Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Reynold Xin
Actually it's *way* harder to upgrade Scala from 2.10 to 2.11, than upgrading the JVM runtime from 7 to 8, because Scala 2.10 and 2.11 are not binary compatible, whereas JVM 7 and 8 are binary compatible except certain esoteric cases. On Thu, Mar 24, 2016 at 4:44 PM, Kostas Sakellis

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Reynold Xin
If you want to go down that route, you should also ask somebody who has had experience managing a large organization's applications and try to update Scala version. On Thu, Mar 24, 2016 at 4:48 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Thu, Mar 24, 2016 at 4:46 PM, Reyno

Re: Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Reynold Xin
The doc fix was merged in 1.6.1, so it will get updated automatically once we push the 1.6.1 docs. On Mon, Mar 7, 2016 at 5:40 PM, Saisai Shao wrote: > Yes, we need to fix the document. > > On Tue, Mar 8, 2016 at 9:07 AM, Mark Hamstra > wrote:

Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Reynold Xin
+Sean, who was playing with this. On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski wrote: > Hi, > > Got the BUILD FAILURE. Anyone looking into it? > > ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests

Re: More Robust DataSource Parameters

2016-03-07 Thread Reynold Xin
ouild be greatly > appreciated). > > With the above answer to #1 and contingent on finding a solution to the > API stability part of it, would you be supportive of a change to do this? > If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to > do #2

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Reynold Xin
What do you mean by consistent? Throughout the life cycle of an app, the executors can come and go and as a result really has no consistency. Do you just need it for a specific job? On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger wrote: > I need getPreferredLocations to

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-02 Thread Reynold Xin
SQL is very common and even some business analysts learn them. Scala and Python are great, but the easiest language to use is often the languages a user already knows. And for a lot of users, that is SQL. On Wednesday, March 2, 2016, Jerry Lam wrote: > Hi guys, > > FYI...

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-07 Thread Reynold Xin
+1 (binding) On Sun, Mar 6, 2016 at 12:08 PM, Egor Pahomov wrote: > +1 > > Spark ODBC server is fine, SQL is fine. > > 2016-03-03 12:09 GMT-08:00 Yin Yang : > >> Skipping docker tests, the rest are green: >> >> [INFO] Spark Project External Kafka

Re: Typo in community databricks cloud docs

2016-03-07 Thread Reynold Xin
Thanks - I've fixed it and it will go out next time we update. For future reference, you can email directly supp...@databricks.com for this. Again - thanks for reporting this. On Sat, Mar 5, 2016 at 4:23 PM, Eugene Morozov wrote: > Hi, I'm not sure where to put

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
nce of > spark.local.dir, even on large memory systems.* > > "Currently it is not possible to not write shuffle files to disk.” > > What changes >would< make it possible? > > The only one that seems possible is to clone the shuffle service and make > it in-memory. > > >

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-04-04 Thread Reynold Xin
Nezih, Have you had a chance to figure out why this is happening? On Tue, Mar 22, 2016 at 1:32 AM, james wrote: > I guess different workload cause diff result ? > > > > -- > View this message in context: >

Re: java.lang.OutOfMemoryError: Unable to acquire bytes of memory

2016-04-04 Thread Reynold Xin
BTW do you still see this when dynamic allocation is off? On Mon, Apr 4, 2016 at 6:16 PM, Reynold Xin <r...@databricks.com> wrote: > Nezih, > > Have you had a chance to figure out why this is happening? > > > On Tue, Mar 22, 2016 at 1:32 AM, james <yiaz...@gma

Re: Build changes after SPARK-13579

2016-04-04 Thread Reynold Xin
pyspark and R On Mon, Apr 4, 2016 at 9:59 PM, Marcelo Vanzin wrote: > No, tests (except pyspark) should work without having to package anything > first. > > On Mon, Apr 4, 2016 at 9:58 PM, Koert Kuipers wrote: > > do i need to run sbt package before

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Reynold Xin
What do you mean? The Jenkins build for Spark uses 2.11 and also builds the thrift server. On Tuesday, April 5, 2016, Raymond Honderdors wrote: > Is anyone looking into this one, Build with Thrift Server & Scala 2.11? > > I9f so when can we expect it > > > >

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-29 Thread Reynold Xin
They work. On Tue, Mar 29, 2016 at 10:01 AM, Koert Kuipers wrote: > if scala prior to sbt 2.10.4 didn't support java 8, does that mean that > 3rd party scala libraries compiled with a scala version < 2.10.4 might not > work on java 8? > > > On Mon, Mar 28, 2016 at 7:06 PM,

[discuss] using deep learning to improve Spark

2016-04-01 Thread Reynold Xin
Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention to a project called DeepSpark that we have been working on for the past three years. We realized that scaling software development was challenging. A large fraction of software engineering has

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
cal.dir as a buffer pool of > > others. > > > > Hence, the performance of Spark is gated by the performance of > > spark.local.dir, even on large memory systems. > > > > "Currently it is not possible to not write shuffle files to disk.” > > > > What c

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
to two orders of magnitude slower than some simple in-memory data partitioning algorithm (e.g. radix sort). Addressing that can speed up certain Spark workloads (large joins, large aggregations) quite a bit. On Fri, Apr 1, 2016 at 2:22 PM, Reynold Xin <r...@databricks.com> wrote: > Su

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
ings but use spark.local.dir as a buffer pool of >> > others. >> > >> > Hence, the performance of Spark is gated by the performance of >> > spark.local.dir, even on large memory systems. >> > >> > "Currently it is not possible to not write shu

Re: explain codegen

2016-04-04 Thread Reynold Xin
;> Kind regards, >> >> Herman van Hövell >> >> 2016-04-04 12:15 GMT+02:00 Ted Yu <yuzhih...@gmail.com >> <javascript:_e(%7B%7D,'cvml','yuzhih...@gmail.com');>>: >> >>> Could the error I encountered be due to missing import(s) of implicit ?

Re: explain codegen

2016-04-03 Thread Reynold Xin
Works for me on latest master. scala> sql("explain codegen select 'a' as a group by 1").head res3: org.apache.spark.sql.Row = [Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == WholeStageCodegen : +- TungstenAggregate(key=[], functions=[], output=[a#10]) : +- INPUT +- Exchange

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-03 Thread Reynold Xin
dalone might be easier upgrade I guess ?). >>> >>> >>> Proposal is for 1.6x line to continue to be supported with critical >>> fixes; newer features will require 2.x and so jdk8 >>> >>> Regards >>> Mridul >>> >>> >&g

Re: Coding style question (about extra anonymous closure within functional transformations)

2016-04-13 Thread Reynold Xin
We prefer the latter. I don't think there are performance differences though. It depends on how big the change is -- massive style updates can make backports harder. On Wed, Apr 13, 2016 at 7:46 PM, Hyukjin Kwon wrote: > Hi all, > > I recently noticed that actually there

Re: DynamoDB data source questions

2016-04-13 Thread Reynold Xin
Responses inline On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford wrote: > Hi Spark gurus, > > At Medium we're using Spark for an ETL job that scans DynamoDB tables and > loads into Redshift. Currently I use a parallel scanner implementation that > writes files to

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54

Re: Proposal of closing some PRs which at least one of committers suggested so

2016-04-23 Thread Reynold Xin
I pushed a commit to close all but the last one. On Sat, Apr 23, 2016 at 2:08 AM, Sean Owen wrote: > Except for the last one I think they're closeable. We can't close any > PR directly. It's possible to push an empty commit with comments like > "Closes #" to make the

Re: HDFS as Shuffle Service

2016-04-28 Thread Reynold Xin
Hm while this is an attractive idea in theory, in practice I think you are substantially overestimating HDFS' ability to handle a lot of small, ephemeral files. It has never really been optimized for that use case. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote:

Re: RDD.broadcast

2016-04-28 Thread Reynold Xin
This is a nice feature in broadcast join. It is just a little bit complicated to do and as a result hasn't been prioritized as highly yet. On Thu, Apr 28, 2016 at 5:51 AM, wrote: > I was aiming to show the operations with pseudo-code, but I apparently > failed,

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Reynold Xin
The old one is deprecated but should still work though. On Thu, May 19, 2016 at 3:51 PM, Arun Allamsetty wrote: > Hi Doug, > > If you look at the API docs here: >

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Reynold Xin
3) > > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(D

[vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-17 Thread Reynold Xin
Hi, In the past the Apache Spark community have created preview packages (not official releases) and used those as opportunities to ask community members to test the upcoming versions of Apache Spark. Several people in the Apache community have suggested we conduct votes for these preview

Re: right outer joins on Datasets

2016-05-20 Thread Reynold Xin
I filed https://issues.apache.org/jira/browse/SPARK-15441 On Thu, May 19, 2016 at 8:48 AM, Andres Perez wrote: > Hi all, I'm getting some odd behavior when using the joinWith > functionality for Datasets. Here is a small test case: > > val left = List(("a", 1), ("a", 2),

Re: Dataset reduceByKey

2016-05-20 Thread Reynold Xin
Andres - this is great feedback. Let me think about it a little bit more and reply later. On Thu, May 19, 2016 at 11:12 AM, Andres Perez wrote: > Hi all, > > We were in the process of porting an RDD program to one which uses > Datasets. Most things were easy to transition,

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
Users would be able to run this already with the 3 lines of code you supplied right? In general there are a lot of methods already on SparkContext and we lean towards the more conservative side in introducing new API variants. Note that this is something we are doing automatically in Spark SQL

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
gt; > On Thu, May 19, 2016, 2:43 AM Reynold Xin <r...@databricks.com > <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote: > >> Users would be able to run this already with the 3 lines of code you >> supplied right? In general there are a lot of methods a

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Reynold Xin
Thanks, Koert. This is great. Please keep them coming. On Tue, May 24, 2016 at 9:27 AM, Koert Kuipers wrote: > https://issues.apache.org/jira/browse/SPARK-15507 > > On Tue, May 24, 2016 at 12:21 PM, Ted Yu wrote: > >> Please log a JIRA. >> >> Thanks >>

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-21 Thread Reynold Xin
This vote passes with 14 +1s (5 binding*) and no 0 or -1! Thanks to everyone who voted. I'll start work on publishing the release. +1: Reynold Xin* Sean Owen* Ovidiu-Cristian MARCU Krishna Sankar Michael Armbrust* Yin Huai Joseph Bradley* Xiangrui Meng* Herman van Hövell tot Westerflier Vishnu

Re: Quick question on spark performance

2016-05-20 Thread Reynold Xin
It's probably due to GC. On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > Hi All, > I am here to get some expert advice on a use case I am working on. > > Cluster & job details below - > > Data - 6 Tb > Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) > >

Re: spark on kubernetes

2016-05-22 Thread Reynold Xin
Kubernetes itself already has facilities for http proxy, doesn't it? On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh wrote: > Hi, > > I am currently working on deploying Spark on kuberentes (K8s) and it is > working fine. I am running Spark with standalone mode

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Reynold Xin
Adding Kay On Wed, May 11, 2016 at 12:01 PM, Brian Cho wrote: > Hi, > > I'm interested in adding read-time (from HDFS) to Task Metrics. The > motivation is to help debug performance issues. After some digging, its > briefly mentioned in SPARK-1683 that this feature didn't

Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Reynold Xin
It seems like the problem here is that we are not using unique names for mapelements_isNull? On Tue, May 17, 2016 at 3:29 PM, Koert Kuipers wrote: > hello all, we are slowly expanding our test coverage for spark > 2.0.0-SNAPSHOT to more in-house projects. today i ran into

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
ns > > On Thu, May 12, 2016 at 3:35 PM, Reynold Xin <r...@databricks.com> wrote: > > That's true. I think I want to differentiate end-user vs developer. > Public > > isn't the best word. Maybe EndUser? > > > > On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkatarama

Re: code change for adding takeSample of DataFrame

2016-05-13 Thread Reynold Xin
Sure go for it. Thanks. On Thu, May 12, 2016 at 11:41 PM, 段石石 wrote: > Hi, all: > > > I have add takeSample in the DataFrame, which sampling with specify num of > the rows. It has a similary version in RDD, but not supported in DataFrame > now. And now local test is

Re: Nested/Chained case statements generate codegen over 64k exception

2016-05-14 Thread Reynold Xin
It might be best to fix this with fallback first, and then figure out how we can do it more intelligently. On Sat, May 14, 2016 at 2:29 AM, Jonathan Gray wrote: > Hi, > > I've raised JIRA SPARK-15258 (with code attached to re-produce problem) > and would like to have a

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
18 May 2016, at 16:28, Sean Owen <so...@cloudera.com> wrote: > > I think it's a good idea. Although releases have been preceded before > by release candidates for developers, it would be good to get a formal > preview/beta release ratified for public consumption ahead of a new >

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
fairly big issue for people using Spark with Mesos. > > Thanks! > Mike > > From: <r...@databricks.com> on behalf of Reynold Xin <r...@apache.org> > Date: Wednesday, May 18, 2016 at 6:40 AM > To: "dev@spark.apache.org" <dev@spark.apache.org> > Subjec

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
/SPARK-15370?jql=project%20=%20SPARK%20AND%20resolution%20=%20Unresolved%20AND%20affectedVersion%20=%202.0.0> > > To rephrase: for 2.0 do you have specific issues that are not a priority > and will released maybe with 2.1 for example? > > Keep up the good work! > > On 18 May

[discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
We currently have three levels of interface annotation: - unannotated: stable public API - DeveloperApi: A lower-level, unstable API intended for developers. - Experimental: An experimental user-facing API. After using this annotation for ~ 2 years, I would like to propose the following

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
That's true. I think I want to differentiate end-user vs developer. Public isn't the best word. Maybe EndUser? On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > On Thu, May 12, 2016 at 2:29 PM, Reynold Xin <r...@databricks.com> wrote: &

Re: spark 2 segfault

2016-05-02 Thread Reynold Xin
Definitely looks like a bug. Ted - are you looking at this? On Mon, May 2, 2016 at 7:15 AM, Koert Kuipers wrote: > Created issue: > https://issues.apache.org/jira/browse/SPARK-15062 > > On Mon, May 2, 2016 at 6:48 AM, Ted Yu wrote: > >> I tried the

[ANNOUNCE] Spark branch-2.0

2016-05-01 Thread Reynold Xin
Hi devs, Three weeks ago I mentioned on the dev list creating branch-2.0 (effectively "feature freeze") in 2 - 3 weeks. I've just created Spark's branch-2.0 to form the basis of the 2.0 release. We have closed ~ 1700 issues. That's huge progress, and we should celebrate that. Compared with past

Re: [build system] short downtime monday morning (5-2-16), 7-9am PDT

2016-05-02 Thread Reynold Xin
Thanks, Shane! On Monday, May 2, 2016, shane knapp wrote: > workers -01 and -04 are back up, is is -06 (as i hit the wrong power > button by accident). :) > > -01 and -04 got hung on shutdown, so i'll investigate them and see > what exactly happened. regardless, we should

Re: SQLContext and "stable identifier required"

2016-05-03 Thread Reynold Xin
Probably not. Want to submit a pull request? On Tuesday, May 3, 2016, Koert Kuipers wrote: > yes it works fine if i switch to using the implicits on the SparkSession > (which is a val) > > but do we want to break the old way of doing the import? > > On Tue, May 3, 2016 at

Re: ClassFormatError in latest spark 2 SNAPSHOT build

2016-04-15 Thread Reynold Xin
Can you post the generated code? df.queryExecution.debug.codeGen() (Or something similar to that) On Friday, April 15, 2016, Koert Kuipers wrote: > not sure why, but i am getting this today using spark 2 snapshots... > i am on java 7 and scala 2.11 > > 16/04/15 12:35:46

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
> mind? You got me curious :) > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Mon, Apr 18, 2016 at 7:44 PM, Reyn

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
The problem with this is that we might introduce event time based trigger in the future, and then it would be more confusing... On Monday, April 18, 2016, Jacek Laskowski wrote: > Hi, > > While working with structured streaming (aka SparkSQL Streams :)) I > thought about adding

auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
We have hit a new high in open pull requests: 469 today. While we can certainly get more review bandwidth, many of these are old and still open for other reasons. Some are stale because the original authors have become busy and inactive, and some others are stale because the committers are not

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
and reopening the > PR, it's fine, but I don't know if it really addresses the underlying > issue. > > On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin <r...@databricks.com> wrote: > > We have hit a new high in open pull requests: 469 today. While we can > > certainly get mo

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
That's not the only one. For example, the hash shuffle manager has been off by default since Spark 1.2, and we'd like to remove it in 2.0: https://github.com/apache/spark/pull/12423 How difficult it is to just change the package name to say v2? On Mon, Apr 18, 2016 at 1:51 PM, Mark Grover

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
, we wouldn't be able to do that if we want to allow Spark 1.6 shuffle service to read something generated by Spark 2.1. On Mon, Apr 18, 2016 at 1:59 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Mon, Apr 18, 2016 at 1:53 PM, Reynold Xin <r...@databricks.com> wrote: > >

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
anzin <van...@cloudera.com> wrote: > On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin <r...@databricks.com> wrote: > > The bigger problem is that it is much easier to maintain backward > > compatibility rather than dictating forward compatibility. For example, > as > &

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Reynold Xin
Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian) syntax? They are not being rendered in email. On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu wrote: > bq. it's actually in use right now in spite of not being in any upstream > HBase release > > If it is

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-17 Thread Reynold Xin
First, really thank you for leading the discussion. I am concerned that it'd hurt Spark more than it helps. As many others have pointed out, this unnecessarily creates a new tier of connectors or 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF. We can alleviate this

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Reynold Xin
Hi Hyukjin, Thanks for asking. For 1 the change is almost always better. For 2 it depends on the context. In general if the type is not obvious, it helps readability to explicitly declare them. For 3 again it depends on context. So while it is a good idea to change 1 to reflect a more

Re: Impact of STW GC events for the driver JVM on overall cluster

2016-04-17 Thread Reynold Xin
Your understanding is correct. If the driver is stuck in GC, then during that period it cannot schedule any tasks. On Sun, Apr 17, 2016 at 10:27 AM, Rahul Tanwani wrote: > Hi Devs, > > In case of stop the world GC events on the driver JVM, since all the > application

Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Reynold Xin
t a good way to run 2 at once. >> >> >> Tom >> >> >> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin <van...@cloudera.com> >> wrote: >> >> >> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin <r...@databricks.com> wrote: >>

Re: Should localProperties be inheritable? Should we change that or document it?

2016-04-15 Thread Reynold Xin
I think this was added a long time ago by me in order to make certain things work for Shark (good old times ...). You are probably right that by now some apps depend on the fact that this is inheritable, and changing that could break them in weird ways. Do you mind documenting this, and also add

more uniform exception handling?

2016-04-18 Thread Reynold Xin
Josh's pull request on rpc exception handling got me to think ... In my experience, there have been a few things related exceptions that created a lot of trouble for us in production debugging: 1. Some exception is thrown, but is caught by some

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
, >>>> but did eventually end up successfully being merged. >>>> >>>> I guess if this just ends up being a committer ping and reopening the >>>> PR, it's fine, but I don't know if it really addresses the underlying >>>> issue. >>&g

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
t; > On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin <r...@databricks.com> wrote: > >> Part of it is how difficult it is to automate this. We can build a >> perfect engine with a lot of rules that understand everything. But the more >> complicated rules we need

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
closed, especially if due to committers not having hte >>>>> bandwidth to check on things, will be very discouraging to new folks. >>>>> Doubly so for those inexperienced with opensource. Even if the message >>>>> says "feel free to reopen for so-and-so

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
essing all data before the shuffle. I > think that is the reason. Do I understand correctly? > > Thanks. > > Zhan Zhang > On Apr 18, 2016, at 10:08 PM, Reynold Xin <r...@databricks.com> wrote: > > Unless I'm really missing something I don't think so. As I said, it goes > throug

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
, Reynold Xin <r...@databricks.com> wrote: > But doExecute is not called? > > On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang <zzh...@hortonworks.com> > wrote: > >> Hi Reynold, >> >> I just check the code for CollectLimit, there is a shuffle happening to >

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
it has to be part of the wholeStageCodeGen. > > Correct me if I am wrong. > > Thanks. > > Zhan Zhang > > On Apr 18, 2016, at 11:09 AM, Reynold Xin <r...@databricks.com> wrote: > > I could be wrong but I think we currently do that through whole stage > codegen.

Re: Possible deadlock in registering applications in the recovery mode

2016-04-17 Thread Reynold Xin
I haven't looked closely at this, but I think your proposal makes sense. On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera wrote: > Hi guys, > > Any update on this? > > Best > > On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera > wrote: > >>

Re: Cartesian join between DataFrames

2016-07-25 Thread Reynold Xin
DataFrame can do cartesian joins. On July 25, 2016 at 3:43:19 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: It appears that RDDs can do a cartesian join, but not DataFrames. Is there a fundamental reason why not, or is this just waiting for someone to implement? I know you can get

[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Reynold Xin
Hi all, Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ patches from 300+ contributors. To download Spark 2.0, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: http://spark.apache.org/releases/spark-release-2-0-0.html

[VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Reynold Xin
Updated documentation at http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs-updated/ On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Reynold Xin
The presentation at Spark Summit SF was probably referring to Structured Streaming. The existing Spark Streaming (dstream) in Spark 2.0 has the same production stability level as Spark 1.6. There is also Kafka 0.10 support in dstream. On July 25, 2016 at 10:26:49 AM, Andy Davidson (

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-23 Thread Reynold Xin
data sources. On Jul 19, 2016, at 7:35 PM, Reynold Xin <r...@databricks.com> wrote: Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Re

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-23 Thread Reynold Xin
The vote has passed with the following +1 votes and no -1 votes. I will work on packaging the new release next week. +1 Reynold Xin* Sean Owen* Shivaram Venkataraman* Jonathan Kelly Joseph E. Gonzalez* Krishna Sankar Dongjoon Hyun Ricardo Almeida Joseph Bradley* Matei Zaharia* Luciano Resende

Re: renaming "minor release" to "feature release"

2016-07-28 Thread Reynold Xin
ards/forwards compatible within a minor release, > generally fixes only > minor/feature release: backwards compatible within a major release, > not forward; generally also includes new features > major release: not backwards compatible and may remove or change > existin

renaming "minor release" to "feature release"

2016-07-28 Thread Reynold Xin
*tl;dr* I would like to propose renaming “minor release” to “feature release” in Apache Spark. *details* Apache Spark’s official versioning policy follows roughly semantic versioning. Each Spark release is versioned as [major].[minor].[maintenance]. That is to say, 1.0.0 and 2.0.0 are both

Re: [build system] jenkins downtime friday afternoon, july 29th 2016

2016-07-29 Thread Reynold Xin
Nice! Thanks! On Fri, Jul 29, 2016 at 1:45 PM, shane knapp wrote: > the move is complete and the machines powered back up right away, with > no problems. we're doing a quick update on the firewall, and then > we'll be done! > > On Fri, Jul 29, 2016 at 1:03 PM, shane knapp

Re: Spark Homepage

2016-07-13 Thread Reynold Xin
It's related to https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-12055 On Wed, Jul 13, 2016 at 12:07 PM, Holden Karau wrote: > This has also been reported on the user@ by a few people - other apache > projects (arrow & hadoop) don't seem to be affected

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Reynold Xin
tions very fast from 11 secs to 3 secs, to 1.8 secs, to > 1.5 secs! Good work !!!] > 7.0. GraphX/Scala > 7.1. Create Graph (small and bigger dataset) OK > 7.2. Structure APIs - OK > 7.3. Social Network/Community APIs - OK > 7.4. Algorithms : PageRank of 2 datasets, aggregateMess

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Reynold Xin
Good idea. https://github.com/apache/spark/pull/14252 On Mon, Jul 18, 2016 at 12:16 PM, Michael Armbrust wrote: > + dev, reynold > > Yeah, thats a good point. I wonder if SparkSession.sqlContext should be > public/deprecated? > > On Mon, Jul 18, 2016 at 8:37 AM,

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Reynold Xin
Yes. But in order to access methods available only in HiveContext a user cast is required. On Tuesday, July 19, 2016, Maciej Bryński <mac...@brynski.pl> wrote: > @Reynold Xin, > How this will work with Hive Support ? > SparkSession.sqlContext return HiveContext ? > > 2016

Re: transtition SQLContext to SparkSession

2016-07-19 Thread Reynold Xin
s is what I'm looking at: > > > https://github.com/apache/spark/blob/24ea875198ffcef4a4c3ba28aba128d6d7d9a395/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L122 > > Michael > > > > On Jul 19, 2016, at 10:01 AM, Reynold Xin <r...@databricks.com> wrote:

[VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-19 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Reynold Xin
_metadata files > > SPARK-15703 Spark UI doesn't show all tasks as completed when it should > > SPARK-15944 Make spark.ml package backward compatible with spark.mllib > vectors > > SPARK-16032 Audit semantics of various insertion operations related to > > partitioned tables

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Reynold Xin
> -1 > > Pending resolution of https://issues.apache.org/jira/browse/SPARK-16522 > (diagnosing now) > > On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin <r...@databricks.com> wrote: > >> Please vote on releasing the following candidate as Apache Spark version >>

Re: ml and mllib persistence

2016-07-12 Thread Reynold Xin
Also Java serialization isn't great for cross platform compatibility. On Tuesday, July 12, 2016, aka.fe2s wrote: > Okay, I think I found an answer on my question. Some models (for instance > org.apache.spark.mllib.recommendation.MatrixFactorizationModel) hold RDDs, > so just

Re: ml and mllib persistence

2016-07-12 Thread Reynold Xin
> On 12.07.2016 19:57, Reynold Xin wrote: > > Also Java serialization isn't great for cross platform compatibility. > > On Tuesday, July 12, 2016, aka.fe2s <aka.f...@gmail.com > <javascript:_e(%7B%7D,'cvml','aka.f...@gmail.com');>> wrote: > >> Okay,

Re: spark git commit: [SPARK-15204][SQL] improve nullability inference for Aggregator

2016-07-05 Thread Reynold Xin
Jacek, This is definitely not necessary, but I wouldn't waste cycles "fixing" things like this when they have virtually zero impact. Perhaps next time we update this code we can "fix" it. Also can you comment on the pull request directly? On Tue, Jul 5, 2016 at 1:07 PM, Jacek Laskowski

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-05 Thread Reynold Xin
Please consider this vote canceled and I will work on another RC soon. On Tue, Jun 21, 2016 at 6:26 PM, Reynold Xin <r...@databricks.com> wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, June 24, 2016

[VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-05 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

Re: Expanded docs for the various storage levels

2016-07-07 Thread Reynold Xin
Please create a patch. Thanks! On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I’m looking at the docs here: > > > http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel >

<    2   3   4   5   6   7   8   9   10   11   >