Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Imran Rashid
I'd like to add SPARK-24296, replicating large blocks over 2GB. Its been up for review for a while, and would end the 2GB block limit (well ... subject to a couple of caveats on SPARK-6235). On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan wrote: > I went through the open JIRA tickets and here is

Re: Review notification bot

2018-07-31 Thread Hyukjin Kwon
(BTW, seems not turned off yet - https://github.com/apache/spark/pull/21939#issuecomment-409412805) 2018년 8월 1일 (수) 오전 2:24, Sean Owen 님이 작성: > I haven't been pinged by this bot :( :) > > But I do like this comments on PRs: like > https://github.com/apache/spark/pull/21925#issuecomment-409035244

[build system] DOWNTIME jenkins unreachable overnight

2018-07-31 Thread shane knapp
our building is finally replacing the broken UPS that keeps biting us... ...which means another bit of downtime. :( it begins in 6 hours (11pm PDT) and will be finished tomorrow (august 1st) by ~8am PDT. shane -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Ryan Blue
Wenchen, I think the misunderstanding is around how the v2 API should work with multiple catalogs. Data sources are read/write implementations that resolve to a single JVM class. When we consider how these implementations should work with multiple table catalogs, I think it is clear that the

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
I actually totally agree that we should make sure it should have no impact on existing code if the feature is not used. On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson wrote: > I don't have a comprehensive knowledge of the project hydrogen PRs, > however I've perused them, and they make

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
I don't have a comprehensive knowledge of the project hydrogen PRs, however I've perused them, and they make substantial modifications to Spark's core DAG scheduler code. What I'm wondering is: how high is the confidence level that the "traditional" code paths are still stable. Put another way,

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature? On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres wrote: > Full continuous processing aggregation support ran into unanticipated > scalability and scheduling

Re: Review notification bot

2018-07-31 Thread Sean Owen
I haven't been pinged by this bot :( :) But I do like this comments on PRs: like https://github.com/apache/spark/pull/21925#issuecomment-409035244 Is the issue that @-mentions cause emails too? Is there any option to maybe only consider pinging someone if they've touched the code within the

DISCUSS: SPARK-24882 data source v2 API improvement

2018-07-31 Thread Wenchen Fan
Hi all, Data source v2 is out for a while. During this release, we migrated most of the streaming sources to the v2 API (SPARK-22911 ) started to migrate file sources (SPARK-23817 ) started to

Re: Writing file

2018-07-31 Thread Wenchen Fan
It depends on how you deploy Spark. The writer just writes data to your specified path(HDFS or local path), but the writer is run on executors. If you deploy Spark with the local mode, i.e. executor and driver are together, then you will see the output file on the driver node. If you deploy Spark

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Wenchen Fan
Here is my interpretation of your proposal, please correct me if something is wrong. End users can read/write a data source with its name and some options. e.g. `df.read.format("xyz").option(...).load`. This is currently the only end-user API for data source v2, and is widely used by Spark users

Writing file

2018-07-31 Thread mattbuttow
According to Stack Overflow (https://stackoverflow.com/q/40786093) it should be possible to write file to a local path and the result should be available on the driver node. However when I try this: df.write.parquet("file:///some/path") the data seems to be written on each node, not a

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Joseph Torres
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4.

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Stavros Kontopoulos
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner). This should allows us to add support for Scala 2.12, I think we can resolve this long standing issue with 2.4. Best, Stavros On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda wrote: > Hi, > > what

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Tomasz Gawęda
Hi, what is the status of Continuous Processing + Aggregations? As far as I remember, Jose Torres said it should  be easy to perform aggregations if coalesce(1) work. IIRC it's already merged to master. Is this work in progress? If yes, it would be great to have full aggregation/join support

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Yu, Yucai
Hi, I would like to share some experience when using AE in eBay’s data warehouse. 1. Saving many manual setting and tuning effort. Setting shuffle.partition one by one query is annoy, with AE, we just need set a big number for all queries. 2. Saving memory. With AE, we can start less

Re: Data source V2

2018-07-31 Thread vaclavkosar
For streaming there is an event StreamingQueryProgress which provides num of input rows for each source. Num of output rows that were written is currently not available in StreamingQueryProgress, but I submitted an PR for that here: https://github.com/apache/spark/pull/21919 If you are interested,

Re: Review notification bot

2018-07-31 Thread Holden Karau
Sure I've turned off the bot for now. I'll look at giving it a preference to more recent contributors and switching it to rather make a suggestion of people to the user rather than doing the ping its self and skipping the PMC list as well. On Tue, Jul 31, 2018 at 12:30 AM, Hyukjin Kwon wrote: >

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Petar Zečević
This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too. It is finished and is ready to be merged (was ready a month ago at least). Do you think you could consider including it in

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Marco Gaido
Hi Wenchen, I think it would be great to consider also - SPARK-24598 : Datatype overflow conditions gives incorrect result As it is a correctness bug. What do you think? Thanks, Marco 2018-07-31 4:01 GMT+02:00 Wenchen Fan : > I went through

RE: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Wang, Carson
Thanks Marco and Wenchen for reviewing. It sounds good to target this for 3.0. I can also share more data on the benchmark. In the 100 TB TPC-DS benchmark we performed on a 100-node cluster, we saw 90% of the 103 queries had performance gain, and 46% of them are more than 1.1x faster.

Re: Why percentile and distinct are not done in one job?

2018-07-31 Thread 吴晓菊
I mean in AnalyzeColumnCommand.scala the first one to compute percentiles and the second one to compute columnStats. Chrysan Wu 吴晓菊 Phone:+86 17717640807 2018-07-30 23:28 GMT+08:00 Reynold Xin : > Which API are you talking about? > > On Mon, Jul 30, 2018 at 7:03 AM 吴晓菊 wrote: > >> I noticed

Re: Review notification bot

2018-07-31 Thread Hyukjin Kwon
> I originally did that, but GitHub told me I could only have one personal and one bot account. If someone else registered the spark-mention-bot I'd be happy to switch it to that. I have my own spare account for testing purpose (spark-test). https://github.com/spark-test I don't mind sharing it.

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Marco Gaido
Hi all, I also like this idea very much and I think it may bring also other performance improvements in the future. Thanks to everybody who worked on this. I agree to target this feature for 3.0. Thanks everybody, Bests. Marco On Tue, 31 Jul 2018, 08:39 Wenchen Fan, wrote: > Hi Carson and

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Wenchen Fan
Hi Carson and Yuanjian, Thanks for contributing to this project and sharing the production use cases! I believe the adaptive execution will be a very important feature of Spark SQL and will definitely benefit a lot of users. I went through the design docs and the high-level design totally makes

Re: Data source V2

2018-07-31 Thread Wenchen Fan
Hi assaf, Thanks for trying data source v2! Data source v2 is still evolving(we marked all the data source v2 interface as @Evolving), and we've already made a lot of API changes in this release(some renaming, switching to InternalRow, etc.). So I'd not encourage people to use data source v2 in

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Yuanjian Li
Thanks Carson, great note! Actually Baidu has ported this patch in our internal folk. I collected some user cases and performance improve effect during Baidu internal usage of this patch, summarize as following 3 scenario: 1. SortMergeJoin to BroadcastJoin The SortMergeJoin transform to

Re: Review notification bot

2018-07-31 Thread Holden Karau
The activeness is a thing that came up in the Beam project POC I'm doing for the same bot (filtered it down to contributors active in the last year only). On Mon, Jul 30, 2018 at 11:08 PM, Jungtaek Lim wrote: > Sorry to chime in, just 2 cents on this since it looks like interesting > topic. > >

Re: Review notification bot

2018-07-31 Thread Jungtaek Lim
Sorry to chime in, just 2 cents on this since it looks like interesting topic. Just to share my habit as a one of contributors (for various projects), I don't take "git history" or "git blame" to find authors of file and ping for review. I just ping for active committers who recently merged the

Data source V2

2018-07-31 Thread assaf.mendelson
Hi all, I am currently in the middle of developing a new data source (for an internal tool) using data source V2. I noticed that SPARK-24882 is planned for 2.4 and includes interface changes. I was wondering if those are planned in addition