Re: Latest spark release in the 1.4 branch

2016-07-06 Thread Reynold Xin
I think last time I tried I had some trouble releasing it because the release scripts no longer work with branch-1.4. You can build from the branch yourself, but it might be better to upgrade to the later versions. On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera wrote: > Hi guys, > > May I know

Re: Dataset and Aggregator API pain points

2016-07-05 Thread Reynold Xin
See https://issues.apache.org/jira/browse/SPARK-16390 On Sat, Jul 2, 2016 at 6:35 PM, Reynold Xin wrote: > Thanks, Koert, for the great email. They are all great points. > > We should probably create an umbrella JIRA for easier tracking. > > > On Saturday, July 2, 2016, Ko

[VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-05 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, July 8, 2016 at 23:00 PDT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

Re: Why's ds.foreachPartition(println) not possible?

2016-07-05 Thread Reynold Xin
> people will be asking about the reasons Spark does this. Where are > > such issues reported usually? > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://medium.com/@jaceklaskowski/ > > Mastering Apache Spark http://bit.ly/mastering-apache-spark > &g

Re: spark git commit: [SPARK-15204][SQL] improve nullability inference for Aggregator

2016-07-05 Thread Reynold Xin
Jacek, This is definitely not necessary, but I wouldn't waste cycles "fixing" things like this when they have virtually zero impact. Perhaps next time we update this code we can "fix" it. Also can you comment on the pull request directly? On Tue, Jul 5, 2016 at 1:07 PM, Jacek Laskowski wrote:

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-05 Thread Reynold Xin
Please consider this vote canceled and I will work on another RC soon. On Tue, Jun 21, 2016 at 6:26 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes > if a ma

Re: Why's ds.foreachPartition(println) not possible?

2016-07-05 Thread Reynold Xin
This seems like a Scala compiler bug. On Tuesday, July 5, 2016, Jacek Laskowski wrote: > Well, there is foreach for Java and another foreach for Scala. That's > what I can understand. But while supporting two language-specific APIs > -- Scala and Java -- Dataset API lost support for such simple

Re: Dataset and Aggregator API pain points

2016-07-02 Thread Reynold Xin
Thanks, Koert, for the great email. They are all great points. We should probably create an umbrella JIRA for easier tracking. On Saturday, July 2, 2016, Koert Kuipers wrote: > after working with the Dataset and Aggregator apis for a few weeks porting > some fairly complex RDD algos (an overall

Re: [jira] [Resolved] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-02 Thread Reynold Xin
Because in that case you cannot merge anything meant for 2.1 until 2.0 is released. On Saturday, July 2, 2016, Jacek Laskowski wrote: > Hi, > > Always release from master. What could be the gotchas? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache

Re: Code Style Formatting

2016-07-01 Thread Reynold Xin
There isn't one pre-made, but the default works out OK. The main thing you'd need to update are spacing changes for function argument indentation and import ordering. On Fri, Jul 1, 2016 at 4:11 AM, Anton Okolnychyi wrote: > Hi, all. > > I've read the Spark code style guide. > I am wondering if

Re: Jenkins networking / port contention

2016-07-01 Thread Reynold Xin
Multiple instances of test runs are usually running in parallel, so they would need to bind to different ports. On Friday, July 1, 2016, Cody Koeninger wrote: > Thanks for the response. I'm talking about test code that starts up > embedded network services for integration testing. > > KafkaTest

Re: Logical Plan

2016-06-30 Thread Reynold Xin
kes > lots of time. > > Not sure what could be done here. > > Thanks > > On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin wrote: > >> Which version are you using here? If the underlying files change, >> technically we should go through optimization again. >>

Re: Logical Plan

2016-06-30 Thread Reynold Xin
Which version are you using here? If the underlying files change, technically we should go through optimization again. Perhaps the real "fix" is to figure out why is logical plan creation so slow for 700 columns. On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh wrote: > Is there a way I can use

Re: Debugging Spark itself in standalone cluster mode

2016-06-30 Thread Reynold Xin
Yes, scheduling is centralized in the driver. For debugging, I think you'd want to set the executor JVM, not the worker JVM flags. On Thu, Jun 30, 2016 at 11:36 AM, cbruegg wrote: > Hello everyone, > > I'm a student assistant in research at the University of Paderborn, working > on integrating

Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
If people want this to happen, please go comment on the INFRA ticket: https://issues.apache.org/jira/browse/INFRA-12185 Otherwise it will probably be dropped. On Mon, Jun 27, 2016 at 7:04 PM, Reynold Xin wrote: > Filed infra ticket: https://issues.apache.org/jira/browse/INFRA-12

Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
Filed infra ticket: https://issues.apache.org/jira/browse/INFRA-12185 On Mon, Jun 27, 2016 at 10:02 AM, Reynold Xin wrote: > Let me look into this... > > > On Monday, June 27, 2016, Nicholas Chammas > wrote: > >> Howdy, >> >> It seems like every week

[ANNOUNCE] Announcing Spark 1.6.2

2016-06-27 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.2! This maintenance release includes fixes across several areas of Spark. You can find the list of changes here: https://s.apache.org/spark-1.6.2 And download the release here: http://spark.apache.org/downloads.html

Re: SPARK-15982 breaks external DataSources

2016-06-27 Thread Reynold Xin
Yup this is bad. Can you create a JIRA ticket too? On Mon, Jun 27, 2016 at 12:22 PM, Koert Kuipers wrote: > hey, > > since SPARK-15982 was fixed (https://github.com/apache/spark/pull/13727) > i believe all external DataSources that rely on using .load(path) without > being a FileFormat themselv

Re: Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Reynold Xin
Let me look into this... On Monday, June 27, 2016, Nicholas Chammas wrote: > Howdy, > > It seems like every week we have at least a couple of people emailing the > user list in vain with "Unsubscribe" in the subject, the body, or both. > > I remember a while back that every email on the user lis

Re: [VOTE][RESULT] Release Apache Spark 1.6.2 (RC2)

2016-06-23 Thread Reynold Xin
Vote passed. Please see below. I will work on packaging the release. +1 (9 votes, 4 binding) Reynold Xin* Sean Owen* Tim Hunter Michael Armbrust* Sean McNamara* Kousuke Saruta Sameer Agarwal Krishna Sankar Vaquar Khan 0 none -1 Maciej Bryński * binding votes On Sun, Jun 19, 2016 at 9:24 PM

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-23 Thread Reynold Xin
5.0. Packages >> 5.1. com.databricks.spark.csv - read/write OK (--packages >> com.databricks:spark-csv_2.10:1.4.0) >> 6.0. DataFrames >> 6.1. cast,dtypes OK >> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK >> 6.3. All joins,sql,set operations,udf OK >>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Reynold Xin
are people that develop for Spark on Windows. The > referenced issue is indeed Minor and has nothing to do with unit tests. > > > > *From:* Mark Hamstra [mailto:m...@clearstorydata.com] > *Sent:* Wednesday, June 22, 2016 4:09 PM > *To:* Marcelo Vanzin > *Cc:* Ul

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Reynold Xin
+1 myself On Wed, Jun 22, 2016 at 12:19 PM, Sean McNamara wrote: > +1 > > On Jun 22, 2016, at 1:14 PM, Michael Armbrust > wrote: > > +1 > > On Wed, Jun 22, 2016 at 11:33 AM, Jonathan Kelly > wrote: > >> +1 >> >> On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter >> wrote: >> >>> +1 This release pas

Re: Question about Bloom Filter in Spark 2.0

2016-06-21 Thread Reynold Xin
SPARK-12818 is about building a bloom filter on existing data. It has nothing to do with the ORC bloom filter, which can be used to do predicate pushdown. On Tue, Jun 21, 2016 at 7:45 PM, BaiRan wrote: > Hi all, > > I have a question about bloom filter implementation in Spark-12818 issue. > If

[VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-21 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

[VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-19 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-19 Thread Reynold Xin
.Assert.fail(Assert.java:86) > > at org.junit.Assert.assertTrue(Assert.java:41) > > at org.junit.Assert.assertNotNull(Assert.java:621) > > at org.junit.Assert.assertNotNull(Assert.java:631) > > at > org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherSe

Re: Thanks For a Job Well Done !!!

2016-06-18 Thread Reynold Xin
Thanks for the kind words, Krishna! Please keep the feedback coming. On Saturday, June 18, 2016, Krishna Sankar wrote: > Hi all, >Just wanted to thank all for the dataset API - most of the times we see > only bugs in these lists ;o). > >- Putting some context, this weekend I was updating

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-18 Thread Reynold Xin
Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin wrote: > > Please vote on releasing the following candidate as Apache Spark version > > 1.6.2! > > > > The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a > > majority of at least 3+1 PMC votes are cast. &

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Reynold Xin
Please go for it! On Friday, June 17, 2016, Pedro Rodriguez wrote: > I would be open to working on Dataset documentation if no one else isn't > already working on it. Thoughts? > > On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian > wrote: > >> As mentioned in the PR description, this is just an ini

testing the kafka 0.10 connector

2016-06-17 Thread Reynold Xin
Cody has graciously worked on a new connector for dstream for Kafka 0.10. Can people that use Kafka test this connector out? The patch is at https://github.com/apache/spark/pull/11863 Although we have stopped merging new features into branch-2.0, this connector is very decoupled from rest of Spark

[VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-16 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.2! The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.2 [ ] -1 Do not release this package because ...

Re: Spark SQL Count Distinct

2016-06-16 Thread Reynold Xin
You should be fine in 1.6 onward. Count distinct doesn't require data to fit in memory there. On Thu, Jun 16, 2016 at 1:57 AM, Avshalom wrote: > Hi all, > > We would like to perform a count distinct query based on a certain filter. > e.g. our data is of the form: > > userId, Name, Restaurant na

Re: Hive-on-Spark with access to Hive from Spark jobs

2016-06-15 Thread Reynold Xin
Are you running Spark on YARN, Mesos, Standalone? For all of them you can make the Hive dependency just part of your application, and then you can manage this pretty easily. On Wed, Jun 15, 2016 at 2:35 AM, Rostyslav Sotnychenko < r.sotnyche...@gmail.com> wrote: > Hello! > > I have a question re

cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Reynold Xin
It's been a while and we have accumulated quite a few bug fixes in branch-1.6. I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get in last minute? On a related note, I'm thinking about cutting 2.0.0 rc this week too. I looked at the 60 unresolved tickets and almost all

Re: Spark Assembly jar ?

2016-06-14 Thread Reynold Xin
You just need to run normal packaging and all the scripts are now setup to run without the assembly jars. On Tuesday, June 14, 2016, Franklyn D'souza wrote: > Just wondering where the spark-assembly jar has gone in 2.0. i've been > reading that its been removed but i'm not sure what the new work

Re: Return binary mode in ThriftServer

2016-06-13 Thread Reynold Xin
Thanks for the email. Things like this (and bugs) are exactly the reason the preview releases exist. It seems like enough people have run into problem with this one that maybe we should just bring it back for backward compatibility. On Monday, June 13, 2016, Egor Pahomov wrote: > In May due to t

Re: tpcds q1 - java.lang.NegativeArraySizeException

2016-06-13 Thread Reynold Xin
Did you try this on master? On Mon, Jun 13, 2016 at 11:26 AM, Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Hi, > > Running the first query of tpcds on a standalone setup (4 nodes, tpcds2 > generated for scale 10 and transformed in parquet under hdfs) it results > in one exce

Re: Add Caller Context in Spark

2016-06-09 Thread Reynold Xin
Is this just to set some string? That makes sense. One thing you would need to make sure is that Spark should still work outside of Hadoop, and also in older versions of Hadoop. On Thu, Jun 9, 2016 at 4:37 PM, Weiqing Yang wrote: > Hi, > > Hadoop has implemented a feature of log tracing – caller

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Yes you can :) On Wed, Jun 8, 2016 at 6:00 PM, Alexander Pivovarov wrote: > Can I just enable spark.kryo.registrationRequired and look at error > messages to get unregistered classes? > > On Wed, Jun 8, 2016 at 5:52 PM, Reynold Xin wrote: > >> Due to type erasure th

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Due to type erasure they have no difference, although watch out for Scala tuple serialization. On Wednesday, June 8, 2016, Ted Yu wrote: > I think the second group (3 classOf's) should be used. > > Cheers > > On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov > wrote: > >> if my RDD is RDD[(St

Re: Dataset API agg question

2016-06-07 Thread Reynold Xin
Take a look at the implementation of typed sum/avg: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scalalang/typed.scala You can implement a typed max/min. On Tue, Jun 7, 2016 at 4:31 PM, Alexander Pivovarov wrote: > Ted, It does not work l

Re: Add hot-deploy capability in Spark Shell

2016-06-06 Thread Reynold Xin
Thanks for the email. How do you deal with in-memory state that reference the classes? This can happen in both streaming and caching in RDD and temporary view creation in SQL. On Mon, Jun 6, 2016 at 3:40 PM, S. Kai Chen wrote: > Hi, > > We use spark-shell heavily for ad-hoc data analysis as well

Re: apologies for flaky BlacklistIntegrationSuite

2016-06-06 Thread Reynold Xin
Thanks for fixing it! On Mon, Jun 6, 2016 at 1:49 PM, Imran Rashid wrote: > Hi all, > > just a heads up, I introduced a flaky test, BlacklistIntegrationSuite, a > week ago or so. I *thought* I had solved the problems, but turns out there > was more flakiness remaining. for now I've just turne

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Reynold Xin
The bahir one was a good argument actually. I just clicked the button to push it into Maven central. On Mon, Jun 6, 2016 at 12:00 PM, Mark Hamstra wrote: > Fine. I don't feel strongly enough about it to continue to argue against > putting the artifacts on Maven Central. > > On Mon, Jun 6, 2016

Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Reynold Xin
Congratulations, Yanbo! On Friday, June 3, 2016, Matei Zaharia wrote: > Hi all, > > The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a > super active contributor in many areas of MLlib. Please join me in > welcoming Yanbo! > > Matei > -

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-02 Thread Reynold Xin
Also what happens if we want to do a second preview release? The naming > doesn't seem to allow then unless we call it preview 2. > > Tom > > > On Wednesday, June 1, 2016 6:27 PM, Sean Owen wrote: > > > On Wed, Jun 1, 2016 at 5:58 PM, Reynold Xin wrote: > >

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Reynold Xin
Hi Sean, (writing this email with my Apache hat on only and not Databricks hat) The preview release is available here: http://spark.apache.org/downloads.html (there is an entire section dedicated to it and also there is a news link to it on the right). Again, I think this is a good opportunity t

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Reynold Xin
To play devil's advocate, previews are technically not RCs. They are actually voted releases. On Wed, Jun 1, 2016 at 1:46 PM, Michael Armbrust wrote: > Yeah, we don't usually publish RCs to central, right? > > On Wed, Jun 1, 2016 at 1:06 PM, Reynold Xin wrote: > >&

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-01 Thread Reynold Xin
They are here ain't they? https://repository.apache.org/content/repositories/orgapachespark-1182/ Did you mean publishing them to maven central? My understanding is that publishing to maven central isn't a required step of doing theses. This might be a good opportunity to discuss that. My thought

Re: Spark Streaming - Twitter on Python current status

2016-05-30 Thread Reynold Xin
I think your understanding is correct. There will be external libraries that allow you to use the twitter streaming dstream API even in 2.0 though. On Sat, May 28, 2016 at 8:37 AM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > As far as I could understand... > 1. Using Python (PySpark

Re: NegativeArraySizeException / segfault

2016-05-27 Thread Reynold Xin
They should get printed if you turn on debug level logging. On Fri, May 27, 2016 at 1:00 PM, Koert Kuipers wrote: > hello all, > after getting our unit tests to pass on spark 2.0.0-SNAPSHOT we are now > trying to run some algorithms at scale on our cluster. > unfortunately this means that when i

Re: Dataset reduceByKey

2016-05-26 Thread Reynold Xin
Here's a ticket: https://issues.apache.org/jira/browse/SPARK-15598 On Fri, May 20, 2016 at 12:35 AM, Reynold Xin wrote: > Andres - this is great feedback. Let me think about it a little bit more > and reply later. > > > On Thu, May 19, 2016 at 11:12 AM, Andres Perez

Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

2016-05-26 Thread Reynold Xin
Yup - but the reason we did the null handling that way was for Python, which also affects Scala. On Thu, May 26, 2016 at 4:17 PM, Koert Kuipers wrote: > ok, thanks for creating ticket. > > just to be clear: my example was in scala > > On Thu, May 26, 2016 at 7:07 PM, Rey

Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

2016-05-26 Thread Reynold Xin
This is unfortunately due to the way we set handle default values in Python. I agree it doesn't follow the principle of least astonishment. Maybe the best thing to do here is to put the actual default values in the Python API for csv (and json, parquet, etc), rather than using None in Python. This

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Reynold Xin
It's probably a good idea to have the vertica dialect too, since it doesn't seem like it'd be too difficult to maintain. It is not going to be as performant as the native Vertica data source, but is going to be much lighter weight. On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller wrote: > Verti

Re: Labeling Jiras

2016-05-25 Thread Reynold Xin
I think the risk is everybody starts following this, then this will be unmanageable, given the size of the number of organizations involved. The two main labels that we actually use are starter + releasenotes. On Wed, May 25, 2016 at 2:58 PM, Luciano Resende wrote: > > > On Wed, May 25, 2016 at

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
ily replaced by .flatMap (to do explosion) and > .select (to rename output columns) > > Cheng > > > On 5/25/16 12:30 PM, Reynold Xin wrote: > > Based on this discussion I'm thinking we should deprecate the two explode > functions. > > On Wednesday, May 25, 2016, Ko

Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Reynold Xin
, 2016 at 8:30 AM, Reynold Xin wrote: > Yup I have published it to maven. Will post the link in a bit. > > One thing is that for developers, it might be better to use the nightly > snapshot because that one probably has fewer bugs than the preview one. > > > On Wednesday, May 25,

Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Reynold Xin
helpful for preparing for the migration. Do you > plan to push 2.0.0-preview to Maven too? (I for one would appreciate the > convenience.) > > On Wed, May 25, 2016 at 8:44 AM, Reynold Xin > wrote: > >> In the past the Spark community have created preview packages (not >

[ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-24 Thread Reynold Xin
In the past the Spark community have created preview packages (not official releases) and used those as opportunities to ask community members to test the upcoming versions of Apache Spark. Several people in the Apache community have suggested we conduct votes for these preview packages and turn th

Re: ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-05-24 Thread Reynold Xin
Thanks, Koert. This is great. Please keep them coming. On Tue, May 24, 2016 at 9:27 AM, Koert Kuipers wrote: > https://issues.apache.org/jira/browse/SPARK-15507 > > On Tue, May 24, 2016 at 12:21 PM, Ted Yu wrote: > >> Please log a JIRA. >> >> Thanks >> >> On Tue, May 24, 2016 at 8:33 AM, Koert

Re: spark on kubernetes

2016-05-21 Thread Reynold Xin
Kubernetes itself already has facilities for http proxy, doesn't it? On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh wrote: > Hi, > > I am currently working on deploying Spark on kuberentes (K8s) and it is > working fine. I am running Spark with standalone mode and checkpointing > the state to

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-21 Thread Reynold Xin
This vote passes with 14 +1s (5 binding*) and no 0 or -1! Thanks to everyone who voted. I'll start work on publishing the release. +1: Reynold Xin* Sean Owen* Ovidiu-Cristian MARCU Krishna Sankar Michael Armbrust* Yin Huai Joseph Bradley* Xiangrui Meng* Herman van Hövell tot Westerflier V

Re: Quick question on spark performance

2016-05-20 Thread Reynold Xin
It's probably due to GC. On Fri, May 20, 2016 at 5:54 PM, Yash Sharma wrote: > Hi All, > I am here to get some expert advice on a use case I am working on. > > Cluster & job details below - > > Data - 6 Tb > Cluster - EMR - 15 Nodes C3-8xLarge (shared by other MR apps) > > Parameters- > --execut

Re: Dataset reduceByKey

2016-05-20 Thread Reynold Xin
Andres - this is great feedback. Let me think about it a little bit more and reply later. On Thu, May 19, 2016 at 11:12 AM, Andres Perez wrote: > Hi all, > > We were in the process of porting an RDD program to one which uses > Datasets. Most things were easy to transition, but one hole in > fun

Re: right outer joins on Datasets

2016-05-20 Thread Reynold Xin
I filed https://issues.apache.org/jira/browse/SPARK-15441 On Thu, May 19, 2016 at 8:48 AM, Andres Perez wrote: > Hi all, I'm getting some odd behavior when using the joinWith > functionality for Datasets. Here is a small test case: > > val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Reynold Xin
The old one is deprecated but should still work though. On Thu, May 19, 2016 at 3:51 PM, Arun Allamsetty wrote: > Hi Doug, > > If you look at the API docs here: > http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.hive.HiveContext,

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Reynold Xin
; at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Del

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
19, 2016, 2:43 AM Reynold Xin > wrote: > >> Users would be able to run this already with the 3 lines of code you >> supplied right? In general there are a lot of methods already on >> SparkContext and we lean towards the more conservative side in introducing >> new API

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
Users would be able to run this already with the 3 lines of code you supplied right? In general there are a lot of methods already on SparkContext and we lean towards the more conservative side in introducing new API variants. Note that this is something we are doing automatically in Spark SQL for

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
rowse/SPARK-15370?jql=project%20=%20SPARK%20AND%20resolution%20=%20Unresolved%20AND%20affectedVersion%20=%202.0.0> > > To rephrase: for 2.0 do you have specific issues that are not a priority > and will released maybe with 2.1 for example? > > Keep up the good work! > > On 1

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
or people using Spark with Mesos. > > Thanks! > Mike > > From: on behalf of Reynold Xin > Date: Wednesday, May 18, 2016 at 6:40 AM > To: "dev@spark.apache.org" > Subject: [vote] Apache Spark 2.0.0-preview release (rc1) > > Hi, > > In the past the Apac

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Reynold Xin
On 18 May 2016, at 16:28, Sean Owen wrote: > > I think it's a good idea. Although releases have been preceded before > by release candidates for developers, it would be good to get a formal > preview/beta release ratified for public consumption ahead of a new > major release. Bett

[vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-17 Thread Reynold Xin
Hi, In the past the Apache Spark community have created preview packages (not official releases) and used those as opportunities to ask community members to test the upcoming versions of Apache Spark. Several people in the Apache community have suggested we conduct votes for these preview packages

Re: CompileException for spark-sql generated code in 2.0.0-SNAPSHOT

2016-05-17 Thread Reynold Xin
It seems like the problem here is that we are not using unique names for mapelements_isNull? On Tue, May 17, 2016 at 3:29 PM, Koert Kuipers wrote: > hello all, we are slowly expanding our test coverage for spark > 2.0.0-SNAPSHOT to more in-house projects. today i ran into this issue... > > thi

Re: Nested/Chained case statements generate codegen over 64k exception

2016-05-14 Thread Reynold Xin
It might be best to fix this with fallback first, and then figure out how we can do it more intelligently. On Sat, May 14, 2016 at 2:29 AM, Jonathan Gray wrote: > Hi, > > I've raised JIRA SPARK-15258 (with code attached to re-produce problem) > and would like to have a go at fixing it but don'

Re: code change for adding takeSample of DataFrame

2016-05-12 Thread Reynold Xin
Sure go for it. Thanks. On Thu, May 12, 2016 at 11:41 PM, 段石石 wrote: > Hi, all: > > > I have add takeSample in the DataFrame, which sampling with specify num of > the rows. It has a similary version in RDD, but not supported in DataFrame > now. And now local test is done, Is it ok to make a pr

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
2, 2016 at 3:35 PM, Reynold Xin wrote: > > That's true. I think I want to differentiate end-user vs developer. > Public > > isn't the best word. Maybe EndUser? > > > > On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman > > wrote: > >> >

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
That's true. I think I want to differentiate end-user vs developer. Public isn't the best word. Maybe EndUser? On Thu, May 12, 2016 at 3:34 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > On Thu, May 12, 2016 at 2:29 PM, Reynold Xin wrote: > > We curre

[discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-05-12 Thread Reynold Xin
We currently have three levels of interface annotation: - unannotated: stable public API - DeveloperApi: A lower-level, unstable API intended for developers. - Experimental: An experimental user-facing API. After using this annotation for ~ 2 years, I would like to propose the following changes:

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Reynold Xin
Adding Kay On Wed, May 11, 2016 at 12:01 PM, Brian Cho wrote: > Hi, > > I'm interested in adding read-time (from HDFS) to Task Metrics. The > motivation is to help debug performance issues. After some digging, its > briefly mentioned in SPARK-1683 that this feature didn't make it due to > metri

Re: SQLContext and "stable identifier required"

2016-05-03 Thread Reynold Xin
Probably not. Want to submit a pull request? On Tuesday, May 3, 2016, Koert Kuipers wrote: > yes it works fine if i switch to using the implicits on the SparkSession > (which is a val) > > but do we want to break the old way of doing the import? > > On Tue, May 3, 2016 at 12:56 PM, Ted Yu > wro

Re: [build system] short downtime monday morning (5-2-16), 7-9am PDT

2016-05-02 Thread Reynold Xin
Thanks, Shane! On Monday, May 2, 2016, shane knapp wrote: > workers -01 and -04 are back up, is is -06 (as i hit the wrong power > button by accident). :) > > -01 and -04 got hung on shutdown, so i'll investigate them and see > what exactly happened. regardless, we should be building happily!

Re: spark 2 segfault

2016-05-02 Thread Reynold Xin
Definitely looks like a bug. Ted - are you looking at this? On Mon, May 2, 2016 at 7:15 AM, Koert Kuipers wrote: > Created issue: > https://issues.apache.org/jira/browse/SPARK-15062 > > On Mon, May 2, 2016 at 6:48 AM, Ted Yu wrote: > >> I tried the same statement using Spark 1.6.1 >> There wa

[ANNOUNCE] Spark branch-2.0

2016-05-01 Thread Reynold Xin
Hi devs, Three weeks ago I mentioned on the dev list creating branch-2.0 (effectively "feature freeze") in 2 - 3 weeks. I've just created Spark's branch-2.0 to form the basis of the 2.0 release. We have closed ~ 1700 issues. That's huge progress, and we should celebrate that. Compared with past r

Re: RDD.broadcast

2016-04-28 Thread Reynold Xin
This is a nice feature in broadcast join. It is just a little bit complicated to do and as a result hasn't been prioritized as highly yet. On Thu, Apr 28, 2016 at 5:51 AM, wrote: > I was aiming to show the operations with pseudo-code, but I apparently > failed, so Java it is J > > Assume the fo

Re: HDFS as Shuffle Service

2016-04-28 Thread Reynold Xin
Hm while this is an attractive idea in theory, in practice I think you are substantially overestimating HDFS' ability to handle a lot of small, ephemeral files. It has never really been optimized for that use case. On Thu, Apr 28, 2016 at 11:15 AM, Michael Gummelt wrote: > > if after a work-load

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54 P

Re: Proposal of closing some PRs which at least one of committers suggested so

2016-04-23 Thread Reynold Xin
I pushed a commit to close all but the last one. On Sat, Apr 23, 2016 at 2:08 AM, Sean Owen wrote: > Except for the last one I think they're closeable. We can't close any > PR directly. It's possible to push an empty commit with comments like > "Closes #" to make the ASF processes close the

Re: YARN Shuffle service and its compatibility

2016-04-19 Thread Reynold Xin
t;> >> Tom >> >> >> On Monday, April 18, 2016 5:23 PM, Marcelo Vanzin >> wrote: >> >> >> On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin wrote: >> > IIUC, the reason for that PR is that they found the string comparison to >> > increase t

Re: RFC: Remove "HBaseTest" from examples?

2016-04-19 Thread Reynold Xin
Ted - what's the "bq" thing? Are you using some 3rd party (e.g. Atlassian) syntax? They are not being rendered in email. On Tue, Apr 19, 2016 at 10:41 AM, Ted Yu wrote: > bq. it's actually in use right now in spite of not being in any upstream > HBase release > > If it is not in upstream, then

Re: RFC: Remote "HBaseTest" from examples?

2016-04-19 Thread Reynold Xin
Yea in general I feel examples that bring in a large amount of dependencies should be outside Spark. On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin wrote: > Hey all, > > Two reasons why I think we should remove that from the examples: > > - HBase now has Spark integration in its own repo, so

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
35 PM, Reynold Xin wrote: > But doExecute is not called? > > On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang > wrote: > >> Hi Reynold, >> >> I just check the code for CollectLimit, there is a shuffle happening to >> collect them in one partition. >> >

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
think that is the reason. Do I understand correctly? > > Thanks. > > Zhan Zhang > On Apr 18, 2016, at 10:08 PM, Reynold Xin wrote: > > Unless I'm really missing something I don't think so. As I said, it goes > through an iterator and after processing each stre

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
the wholeStageCodeGen. > > Correct me if I am wrong. > > Thanks. > > Zhan Zhang > > On Apr 18, 2016, at 11:09 AM, Reynold Xin wrote: > > I could be wrong but I think we currently do that through whole stage > codegen. After processing every row on the stream side,

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
gt;> bandwidth to check on things, will be very discouraging to new folks. >>>>> Doubly so for those inexperienced with opensource. Even if the message >>>>> says "feel free to reopen for so-and-so reason", new folks who lack >>>>> confidence are going to

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
zin wrote: > On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin wrote: > > The bigger problem is that it is much easier to maintain backward > > compatibility rather than dictating forward compatibility. For example, > as > > Marcin said, if we come up with a slightly different shu

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
mance, we wouldn't be able to do that if we want to allow Spark 1.6 shuffle service to read something generated by Spark 2.1. On Mon, Apr 18, 2016 at 1:59 PM, Marcelo Vanzin wrote: > On Mon, Apr 18, 2016 at 1:53 PM, Reynold Xin wrote: > > That's not the only one. For example,

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
That's not the only one. For example, the hash shuffle manager has been off by default since Spark 1.2, and we'd like to remove it in 2.0: https://github.com/apache/spark/pull/12423 How difficult it is to just change the package name to say v2? On Mon, Apr 18, 2016 at 1:51 PM, Mark Grover wrot

<    2   3   4   5   6   7   8   9   10   11   >