Re: Beam's recent community development work

2018-07-02 Thread Matei Zaharia
I think telling people that they’re being considered as committers early on is a good idea, but AFAIK we’ve always had individual committers do that with contributors who were doing great work in various areas. We don’t have a centralized process for it though — it’s up to whoever wants to work

Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also feel the two projects have very different background and maturity phase. There are 1300+ contributors to Spark, and only 300 to Beam, with the vast majority of contributions coming from a single company for Beam (based on my

Re: Beam's recent community development work

2018-07-02 Thread Holden Karau
As someone who floats a bit between both projects (as a contributor) I'd love to see us adopt some of these techniques to be pro-active about growing our committer-ship (I think perhaps we could do this by also moving some of the newer committers into the PMC faster so there are more eyes out

Fwd: Beam's recent community development work

2018-07-02 Thread Sean Owen
Worth, I think, a read and consideration from Spark folks. I'd be interested in comments; I have a few reactions too. -- Forwarded message - From: Kenneth Knowles Date: Sat, Jun 30, 2018 at 1:15 AM Subject: Beam's recent community development work To: , , Griselda Cuevas <

[RESULT] [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
The vote passes. Thanks to all who helped with the release! I'll start publishing everything tomorrow, and an announcement will be sent when artifacts have propagated to the mirrors (probably early next week). +1 (* = binding): - Marcelo Vanzin * - Sean Owen * - Tom Graves * - Holder Kaurau *- 

Re: [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
I forgot to post it, I'm +1. Tom On Monday, July 2, 2018, 12:19:08 AM CDT, Holden Karau wrote: Leaving documents aside (I think we should maybe have a thread on how we want to handle doc changes to existing releases on dev@) I'm +1 PySpark venv checks out. On Sun, Jul 1, 2018 at 9:40

Retraining with (each document as separate file) creates OOME

2018-07-02 Thread Jatin Puri
May be this is a bug. The source can be found at: https://github.com/purijatin/spark-retrain-bug *Issue:* The program takes input a set of documents. Where each document is in a separate file. The spark program tf-idf of the terms (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf). Once