Re: Welcoming some new committers and PMC members

2019-09-09 Thread Mingjie Tang
Congratulations!

On Tue, Sep 10, 2019 at 9:11 AM Jungtaek Lim  wrote:

> Congratulations! Well deserved!
>
> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
>
>> Congratulations!
>>
>> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:
>>
>>> congrats everyone!  :)
>>>
>>> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > The Spark PMC recently voted to add several new committers and one PMC
>>> member. Join me in welcoming them to their new roles!
>>> >
>>> > New PMC member: Dongjoon Hyun
>>> >
>>> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming
>>> Wang, Weichen Xu, Ruifeng Zheng
>>> >
>>> > The new committers cover lots of important areas including ML, SQL,
>>> and data sources, so it’s great to have them here. All the best,
>>> >
>>> > Matei and the Spark PMC
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Mingjie Tang
+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> I think the proposal laid out in SPARK-18813 is well done, and I do think
> it is going to improve the process going forward. I also really like the
> idea of getting the community to vote on JIRAs to give some of them
> priority - provided that we listen to those votes, of course. The biggest
> problem I see is that we do have several active contributors and those who
> want to help implement these changes, but PRs are reviewed rather
> sporadically and I imagine it is very difficult for contributors to
> understand why some get reviewed and some do not. The most important thing
> we can do, given that MLlib currently has a very limited committer review
> bandwidth, is to make clear issues that, if worked on, will definitely get
> reviewed. A hard thing to do in open source, no doubt, but even if we have
> to limit the scope of such issues to a very small subset, it's a gain for
> all I think.
>
> On a related note, I would love to hear some discussion on the higher
> level goal of Spark MLlib (if this derails the original discussion, please
> let me know and we can discuss in another thread). The roadmap does contain
> specific items that help to convey some of this (ML parity with MLlib,
> model persistence, etc...), but I'm interested in what the "mission" of
> Spark MLlib is. We often see PRs for brand new algorithms which are
> sometimes rejected and sometimes not. Do we aim to keep implementing more
> and more algorithms? Or is our focus really, now that we have a reasonable
> library of algorithms, to simply make the existing ones faster/better/more
> robust? Should we aim to make interfaces that are easily extended for
> developers to easily implement their own custom code (e.g. custom
> optimization libraries), or do we want to restrict things to out-of-the box
> algorithms? Should we focus on more flexible, general abstractions like
> distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
>
> On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
> wrote:
>
>> Hi all,
>>
>> This is a general call for thoughts about the process for the MLlib
>> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>>
>> Summary:
>> * This process is about committers indicating intention to shepherd and
>> review.
>> * The goal is to improve visibility and communication.
>> * This is fairly orthogonal to the SIP discussion since this proposal is
>> more about setting release targets than about proposing future plans.
>>
>> Thanks!
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>


Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-10 Thread Mingjie Tang
+1, it is useful.

On Sat, Dec 10, 2016 at 9:28 PM, Dongjoon Hyun  wrote:

> Thank you for the opinion, Felix.
>
> Bests,
> Dongjoon.
>
> On Sat, Dec 10, 2016 at 11:00 AM, Felix Cheung 
> wrote:
>
>> +1 I think it's useful to always have a pure SQL way and skip header for
>> plain text / csv that lots of companies have.
>>
>>
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Friday, December 9, 2016 9:42:58 AM
>> *To:* Dongjin Lee; dev@spark.apache.org
>> *Subject:* Re: Question about SPARK-11374 (skip.header.line.count)
>>
>> Thank you for the opinion, Dongjin!
>>
>>
>> On Thu, Dec 8, 2016 at 21:56 Dongjin Lee  wrote:
>>
>>> +1 For this idea. I need it also.
>>>
>>> Regards,
>>> Dongjin
>>>
>>> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> Hi, All.
>>>
>>>
>>>
>>>
>>>
>>> Could you give me some opinion?
>>>
>>>
>>>
>>>
>>>
>>> There is an old SPARK issue, SPARK-11374, about removing header lines
>>> from text file.
>>>
>>>
>>> Currently, Spark supports removing CSV header lines by the following way.
>>>
>>>
>>>
>>>
>>>
>>> ```
>>>
>>>
>>> scala> spark.read.option("header","true").csv("/data").show
>>>
>>>
>>> +---+---+
>>>
>>>
>>> | c1| c2|
>>>
>>>
>>> +---+---+
>>>
>>>
>>> |  1|  a|
>>>
>>>
>>> |  2|  b|
>>>
>>>
>>> +---+---+
>>>
>>>
>>> ```
>>>
>>>
>>>
>>>
>>>
>>> In SQL world, we can support that like the Hive way,
>>> `skip.header.line.count`.
>>>
>>>
>>>
>>>
>>>
>>> ```
>>>
>>>
>>> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
>>> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
>>> TBLPROPERTIES('skip.header.line.count'='1')")
>>>
>>>
>>> scala> sql("SELECT * FROM t1").show
>>>
>>>
>>> +---+-+
>>>
>>>
>>> | id|value|
>>>
>>>
>>> +---+-+
>>>
>>>
>>> |  1|a|
>>>
>>>
>>> |  2|b|
>>>
>>>
>>> +---+-+
>>>
>>>
>>> ```
>>>
>>>
>>>
>>>
>>>
>>> Although I made a PR for this based on the JIRA issue, I want to know
>>> this is really needed feature.
>>>
>>>
>>> Is it need for your use cases? Or, it's enough for you to remove them in
>>> a preprocessing stage.
>>>
>>>
>>> If this is too old and not proper in these days, I'll close the PR and
>>> JIRA issue as WON'T FIX.
>>>
>>>
>>>
>>>
>>>
>>> Thank you for all in advance!
>>>
>>>
>>>
>>>
>>>
>>> Bests,
>>>
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>>
>>>
>>> -
>>>
>>>
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> * Dongjin Lee *
>>>
>>>
>>> * Software developer in Line+. So interested in massive-scale machine
>>> learning. facebook: www.facebook.com/dongjin.lee.kr
>>> 
>>> linkedin: kr.linkedin.com/in/dongjinleekr
>>>  github:
>>> github.com/dongjinleekr
>>>  twitter: www.twitter.com/dongjinleekr
>>>  *
>>>
>>>
>>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-10 Thread Mingjie Tang
+1 (non-binding)

On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das 
wrote:

> +1 binding
>
> On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta  > wrote:
>
>> +1 (non-binding)
>>
>>
>> On 2016年11月08日 15:09, Reynold Xin wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>> 7ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
>>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/
>>> >
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>> >> .0.2-rc3-docs/>
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>