from:"Xiangrui Meng"

Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Xiangrui Meng

+1

On Wed, Aug 21, 2024, 10:24 AM Mridul Muralidharan  wrote:

> +1
>
>
> Regards,
> Mridul
>
>
> On Wed, Aug 21, 2024 at 11:46 AM Reynold Xin 
> wrote:
>
>> +1
>>
>> On Wed, Aug 21, 2024 at 6:42 PM Shivaram Venkataraman <
>> shivaram.venkatara...@gmail.com> wrote:
>>
>>> Hi all
>>>
>>> Based on the previous discussion thread [1], I hereby call a vote to
>>> deprecate the SparkR module in Apache Spark with the upcoming Spark 4
>>> release and remove it in the next major release Spark 5.
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because ..
>>>
>>> This vote will be open for the next 72 hours
>>>
>>> Thanks
>>> Shivaram
>>>
>>> [1] https://lists.apache.org/thread/qjgsgxklvpvyvbzsx1qr8o533j4zjlm5
>>>
>>

Re: [DISCUSS] Deprecating SparkR

2024-08-13 Thread Xiangrui Meng

+1

On Tue, Aug 13, 2024, 2:43 PM Jungtaek Lim 
wrote:

> +1
>
> Looks to be sufficient to VOTE?
>
> 2024년 8월 14일 (수) 오전 1:10, Wenchen Fan 님이 작성:
>
>> +1
>>
>> On Tue, Aug 13, 2024 at 10:50 PM L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> On Tue, Aug 13, 2024 at 2:54 AM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > Dongjoon
>>> >
>>> > On Mon, Aug 12, 2024 at 17:52 Holden Karau 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> Are the sparklyr folks on this list?
>>> >>
>>> >> Twitter: https://twitter.com/holdenkarau
>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >> Pronouns: she/her
>>> >>
>>> >>
>>> >> On Mon, Aug 12, 2024 at 5:22 PM Xiao Li  wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> Hyukjin Kwon  于2024年8月12日周一 16:18写道：
>>> 
>>>  +1
>>> 
>>>  On Tue, Aug 13, 2024 at 7:04 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>> >
>>> > And just for the record, the stats that I screenshotted in that
>>> thread I linked to showed the following page views for each sub-section
>>> under `docs/latest/api/`:
>>> >
>>> > - python: 758K
>>> > - java: 66K
>>> > - sql: 39K
>>> > - scala: 35K
>>> > - r: <1K
>>> >
>>> > I don’t recall over what time period those stats were collected
>>> for, and there are certainly some factors of how the stats are gathered and
>>> how the various language API docs are accessed that impact those numbers.
>>> So it’s by no means a solid, objective measure. But I thought it was an
>>> interesting signal nonetheless.
>>> >
>>> >
>>> > On Aug 12, 2024, at 5:50 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>> >
>>> > Not an R user myself, but +1.
>>> >
>>> > I first wondered about the future of SparkR after noticing how low
>>> the visit stats were for the R API docs as compared to Python and Scala. (I
>>> can’t seem to find those visit stats for the API docs anymore.)
>>> >
>>> >
>>> > On Aug 12, 2024, at 11:47 AM, Shivaram Venkataraman <
>>> shivaram.venkatara...@gmail.com> wrote:
>>> >
>>> > Hi
>>> >
>>> > About ten years ago, I created the original SparkR package as part
>>> of my research at UC Berkeley [SPARK-5654]. After my PhD I started as a
>>> professor at UW-Madison and my contributions to SparkR have been in the
>>> background given my availability. I continue to be involved in the
>>> community and teach a popular course at UW-Madison which uses Apache Spark
>>> for programming assignments.
>>> >
>>> > As the original contributor and author of a research paper on
>>> SparkR, I also continue to get private emails from users. A common question
>>> I get is whether one should use SparkR in Apache Spark or the sparklyr
>>> package (built on top of Apache Spark). You can also see this in
>>> StackOverflow questions and other blog posts online:
>>> https://www.google.com/search?q=sparkr+vs+sparklyr . While, I have
>>> encouraged users to choose the SparkR package as it is maintained by the
>>> Apache project, the more I looked into sparklyr, the more I was convinced
>>> that it is a better choice for R users that want to leverage the power of
>>> Spark:
>>> >
>>> > (1) sparklyr is developed by a community of developers who
>>> understand the R programming language deeply, and as a result is more
>>> idiomatic. In hindsight, sparklyr’s more idiomatic approach would have been
>>> a better choice than the Scala-like API we have in SparkR.
>>> >
>>> > (2) Contributions to SparkR have decreased slowly. Over the last
>>> two years, there have been 65 commits on the Spark R codebase (compared to
>>> ~2200 on the Spark Python code base). In contrast Sparklyr has over 300
>>> commits in the same period..
>>> >
>>> > (3) Previously, using and deploying sparklyr had been cumbersome
>>> as it needed careful alignment of versions between Apache Spark and
>>> sparklyr. However, the sparklyr community has implemented a new Spark
>>> Connect based architecture which eliminates this issue.
>>> >
>>> > (4) The sparklyr community has maintained their package on CRAN –
>>> it takes some effort to do this as the CRAN release process requires
>>> passing a number of tests. While SparkR was on CRAN initially, we could not
>>> maintain that given our release process and cadence. This makes sparklyr
>>> much more accessible to the R community.
>>> >
>>> > So it is with a bittersweet feeling that I’m writing this email to
>>> propose that we deprecate SparkR, and recommend sparklyr as the R language
>>> binding for Spark. This will reduce complexity of our own codebase, and
>>> more importantly reduce confusion for users. As the sparklyr package is
>>> distributed using the same permissive license as Apache Spark, there should
>>> be no downside for existing SparkR users in adopting it.
>>> >
>>> > My proposal i

Re: [VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Xiangrui Meng

+1

On Wed, Sep 21, 2022 at 6:53 PM Kent Yao  wrote:

> +1
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi A** library t**hat
> brings useful functions from various modern database management systems to 
> **Apache
> Spark .*
>
>
>
>  Replied Message 
> From Hyukjin Kwon 
> Date 09/22/2022 09:43
> To dev 
> Subject Re: [VOTE] SPIP: Support Docker Official Image for Spark
> Starting with my +1.
>
> On Thu, 22 Sept 2022 at 10:41, Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to start a vote for SPIP: "Support Docker Official Image
>> for Spark"
>>
>> The goal of the SPIP is to add Docker Official Image(DOI)
>>  to ensure the Spark
>> Docker images
>> meet the quality standards for Docker images, to provide these Docker
>> images for users
>> who want to use Apache Spark via Docker image.
>>
>> Please also refer to:
>>
>> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Support
>> Docker Official Image for Spark
>> 
>> - SPIP doc: SPIP: Support Docker Official Image for Spark
>> 
>> - JIRA: SPARK-40513 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-28 Thread Xiangrui Meng

+1. And we should start testing 3.7 and maybe 3.8 in Jenkins.

On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun 
wrote:

> Thank you for starting the thread.
>
> In addition to that, we currently are testing Python 3.6 only in Apache
> Spark Jenkins environment.
>
> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1 will start
> next January
> (https://spark.apache.org/versioning-policy.html), I'm +1 for the
> deprecation (Python < 3.6) at Apache Spark 3.0.0.
>
> It's just a deprecation to prepare the next-step development cycle.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz 
> wrote:
>
>> Hi everyone,
>>
>> While deprecation of Python 2 in 3.0.0 has been announced
>> ,
>> there is no clear statement about specific continuing support of different
>> Python 3 version.
>>
>> Specifically:
>>
>>- Python 3.4 has been retired this year.
>>- Python 3.5 is already in the "security fixes only" mode and should
>>be retired in the middle of 2020.
>>
>> Continued support of these two blocks adoption of many new Python
>> features (PEP 468)  and it is hard to justify beyond 2020.
>>
>> Should these two be deprecated in 3.0.0 as well?
>>
>> --
>> Best regards,
>> Maciej
>>
>>

Re: SparkGraph review process

2019-10-04 Thread Xiangrui Meng

Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an
individual contributor here. My work on the graph SPIP as well as other
Spark features I contributed to are not associated with my employer. It
became quite challenging for me to keep track of the graph SPIP work due to
less available time at home.

On retrospective, we should have involved more Spark devs and committers
early on so there is no single point of failure, i.e., me. Hopefully it is
not too late to fix. I summarize my thoughts here to help onboard other
reviewers:

1. On the technical side, my main concern is the runtime dependency on
org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
came out with the solution to shade a few Scala libraries to avoid
pollution. However, I'm not super confident that the approach is
sustainable for two reasons: a) there exists no proper shading libraries
for Scala, 2) We will have to wait for upgrades from those Scala libraries
before we can upgrade Spark to use a newer Scala version. So it would be
great if some Scala experts can help review the current implementation and
help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper
methods for each algorithm, which later became a major maintenance burden.
Builders and setters/getters are more maintainable. I will comment again on
the PR.

3. The proposed API partitions graph into sub-graphs, as described in the
property graph model. It is unclear to me how it would affect query
performance because it requires SQL optimizer to correctly recognize data
from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a
Spark 3.0 release blocker because it doesn't require breaking changes. If
we miss the code freeze deadline, we can introduce a build flag to exclude
the module from the official release/distribution, and then make it default
once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I
think the best option would be submitting the work to Apache Incubator
instead to unblock the work. But maybe it is too earlier to discuss this
option.

It would be great if other committers can offer help on the review! Really
appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg  wrote:

> Hello dear Spark community
>
> We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
>
> The SPIP work has been on hold for a period of time now, as priorities at
> Databricks have changed which has occupied Xiangrui's time (as well as
> other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
>
> In an attempt to not lose track of this project we now reach out to you to
> ask whether there are any Spark committers in the community who would be
> prepared to commit to helping us review and merge our code contributions to
> Apache Spark? We are not asking for lots of direct development support, as
> we believe we have the implementation more or less completed already since
> early this year. There is a proof-of-concept PR (
> https://github.com/apache/spark/pull/24297) which contains the
> functionality.
>
> If you could offer such aid it would be greatly appreciated. None of us
> are Spark committers, which is hindering our ability to deliver this
> project in time for Spark 3.0.
>
> Sincerely
> the Neo4j Graph Analytics team
> Mats, Martin, Max, Sören, Jonatan
>
>

[ANNOUNCEMENT] Plan for dropping Python 2 support

2019-06-03 Thread Xiangrui Meng

Hi all,

Today we announced the plan for dropping Python 2 support
 [1]
in Apache Spark:

As many of you already knew, Python core development team and many utilized
Python packages like Pandas and NumPy will drop Python 2 support in or
before 2020/01/01  [2]. Apache Spark has
supported both Python 2 and 3 since Spark 1.4 release in 2015. However,
maintaining Python 2/3 compatibility is an increasing burden and it
essentially limits the use of Python 3 features in Spark. Given the end of
life (EOL) of Python 2 is coming, we plan to eventually drop Python 2
support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support.
PySpark users will see a deprecation warning if Python 2 is used. We will
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release (excluding patch
release) in 2020, after Python 2 EOL on 2020/01/01. PySpark users will see
an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases
will continue supporting Python 2. However, after Python 2 EOL, we might
not take patches that are specific to Python 2.

Best,
Xiangrui

[1]: http://spark.apache.org/news/plan-for-dropping-python-2-support.html
[2]: https://python3statement.org/

Re: Should python-2 be supported in Spark 3.0?

2019-06-03 Thread Xiangrui Meng

I updated Spark website and announced the plan for dropping python 2
support there:
http://spark.apache.org/news/plan-for-dropping-python-2-support.html. I
will send an announcement email to user@ and dev@. -Xiangrui

On Fri, May 31, 2019 at 10:54 PM Felix Cheung 
wrote:

> Very subtle but someone might take
>
> “We will drop Python 2 support in a future release in 2020”
>
> To mean any / first release in 2020. Whereas the next statement indicates
> patch release is not included in above. Might help reorder the items or
> clarify the wording.
>
>
> --
> *From:* shane knapp 
> *Sent:* Friday, May 31, 2019 7:38:10 PM
> *To:* Denny Lee
> *Cc:* Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark
> Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng;
> dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1000  ;)
>
> On Sat, Jun 1, 2019 at 6:53 AM Denny Lee  wrote:
>
>> +1
>>
>> On Fri, May 31, 2019 at 17:58 Holden Karau  wrote:
>>
>>> +1
>>>
>>> On Fri, May 31, 2019 at 5:41 PM Bryan Cutler  wrote:
>>>
>>>> +1 and the draft sounds good
>>>>
>>>> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:
>>>>
>>>>> Here is the draft announcement:
>>>>>
>>>>> ===
>>>>> Plan for dropping Python 2 support
>>>>>
>>>>> As many of you already knew, Python core development team and many
>>>>> utilized Python packages like Pandas and NumPy will drop Python 2 support
>>>>> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
>>>>> since Spark 1.4 release in 2015. However, maintaining Python 2/3
>>>>> compatibility is an increasing burden and it essentially limits the use of
>>>>> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
>>>>> coming, we plan to eventually drop Python 2 support as well. The current
>>>>> plan is as follows:
>>>>>
>>>>> * In the next major release in 2019, we will deprecate Python 2
>>>>> support. PySpark users will see a deprecation warning if Python 2 is used.
>>>>> We will publish a migration guide for PySpark users to migrate to Python 
>>>>> 3.
>>>>> * We will drop Python 2 support in a future release in 2020, after
>>>>> Python 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is
>>>>> used.
>>>>> * For releases that support Python 2, e.g., Spark 2.4, their patch
>>>>> releases will continue supporting Python 2. However, after Python 2 EOL, 
>>>>> we
>>>>> might not take patches that are specific to Python 2.
>>>>> ===
>>>>>
>>>>> Sean helped make a pass. If it looks good, I'm going to upload it to
>>>>> Spark website and announce it here. Let me know if you think we should do 
>>>>> a
>>>>> VOTE instead.
>>>>>
>>>>> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng 
>>>>> wrote:
>>>>>
>>>>>> I created https://issues.apache.org/jira/browse/SPARK-27884 to track
>>>>>> the work.
>>>>>>
>>>>>> On Thu, May 30, 2019 at 2:18 AM Felix Cheung <
>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>
>>>>>>> We don’t usually reference a future release on website
>>>>>>>
>>>>>>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>>>>>>
>>>>>>> I suspect people will then ask when is Spark 3.0 coming out then.
>>>>>>> Might need to provide some clarity on that.
>>>>>>>
>>>>>>
>>>>>> We can say the "next major release in 2019" instead of Spark 3.0.
>>>>>> Spark 3.0 timeline certainly requires a new thread to discuss.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *From:* Reynold Xin 
>>>>>>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>>>>>>> *To:* shane knapp
>>>>>>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen;
>>>>>>> Wenchen Fen; Xiangrui Meng; dev; user
>>>>>>> *Subject:* Re: Should

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng

Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized
Python packages like Pandas and NumPy will drop Python 2 support in or
before 2020/01/01. Apache Spark has supported both Python 2 and 3 since
Spark 1.4 release in 2015. However, maintaining Python 2/3 compatibility is
an increasing burden and it essentially limits the use of Python 3 features
in Spark. Given the end of life (EOL) of Python 2 is coming, we plan to
eventually drop Python 2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support.
PySpark users will see a deprecation warning if Python 2 is used. We will
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2
EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases
will continue supporting Python 2. However, after Python 2 EOL, we might
not take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark
website and announce it here. Let me know if you think we should do a VOTE
instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:

> I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
> work.
>
> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
> wrote:
>
>> We don’t usually reference a future release on website
>>
>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>
>> I suspect people will then ask when is Spark 3.0 coming out then. Might
>> need to provide some clarity on that.
>>
>
> We can say the "next major release in 2019" instead of Spark 3.0. Spark
> 3.0 timeline certainly requires a new thread to discuss.
>
>
>>
>>
>> --
>> *From:* Reynold Xin 
>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>> *To:* shane knapp
>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
>> Fen; Xiangrui Meng; dev; user
>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>
>> +1 on Xiangrui’s plan.
>>
>> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>>
>>> I don't have a good sense of the overhead of continuing to support
>>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>>
>>>> from the build/test side, it will actually be pretty easy to continue
>>> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>>>
>>
>>> that being said, i will be cracking a bottle of champagne when i can
>>> delete all of the ansible and anaconda configs for python2.x.  :)
>>>
>>
> On the development side, in a future release that drops Python 2 support
> we can remove code that maintains python 2/3 compatibility and start using
> python 3 only features, which is also quite exciting.
>
>
>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng

I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
work.

On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
wrote:

> We don’t usually reference a future release on website
>
> > Spark website and state that Python 2 is deprecated in Spark 3.0
>
> I suspect people will then ask when is Spark 3.0 coming out then. Might
> need to provide some clarity on that.
>

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0
timeline certainly requires a new thread to discuss.


>
>
> --
> *From:* Reynold Xin 
> *Sent:* Thursday, May 30, 2019 12:59:14 AM
> *To:* shane knapp
> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
> Fen; Xiangrui Meng; dev; user
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> +1 on Xiangrui’s plan.
>
> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>
>> I don't have a good sense of the overhead of continuing to support
>>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>>
>>> from the build/test side, it will actually be pretty easy to continue
>> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>>
>
>> that being said, i will be cracking a bottle of champagne when i can
>> delete all of the ansible and anaconda configs for python2.x.  :)
>>
>
On the development side, in a future release that drops Python 2 support we
can remove code that maintains python 2/3 compatibility and start using
python 3 only features, which is also quite exciting.


>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng

Hi all,

I want to revive this old thread since no action was taken so far. If we
plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early
as possible and let users know ahead. PySpark depends on Python, numpy,
pandas, and pyarrow, all of which are sunsetting Python 2 support by
2020/01/01 per https://python3statement.org/. At that time we cannot really
support Python 2 because the dependent libraries do not plan to make new
releases, even for security reasons. So I suggest the following:

1. Update Spark website and state that Python 2 is deprecated in Spark 3.0
and its support will be removed in a release after 2020/01/01.
2. Make a formal announcement to dev@ and users@.
3. Add Apache Spark project to https://python3statement.org/ timeline.
4. Update PySpark, check python version and print a deprecation warning if
version < 3.

Any thoughts and suggestions?

Best,
Xiangrui

On Mon, Sep 17, 2018 at 6:54 PM Erik Erlandson  wrote:

>
> I think that makes sense. The main benefit of deprecating *prior* to 3.0
> would be informational - making the community aware of the upcoming
> transition earlier. But there are other ways to start informing the
> community between now and 3.0, besides formal deprecation.
>
> I have some residual curiosity about what it might mean for a release like
> 2.4 to still be in its support lifetime after Py2 goes EOL. I asked Apache
> Legal  to comment. It is
> possible there are no issues with this at all.
>
>
> On Mon, Sep 17, 2018 at 4:26 PM, Reynold Xin  wrote:
>
>> i'd like to second that.
>>
>> if we want to communicate timeline, we can add to the release notes
>> saying py2 will be deprecated in 3.0, and removed in a 3.x release.
>>
>> --
>> excuse the brevity and lower case due to wrist injury
>>
>>
>> On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia 
>> wrote:
>>
>>> That’s a good point — I’d say there’s just a risk of creating a
>>> perception issue. First, some users might feel that this means they have to
>>> migrate now, which is before Python itself drops support; they might also
>>> be surprised that we did this in a minor release (e.g. might we drop Python
>>> 2 altogether in a Spark 2.5 if that later comes out?). Second, contributors
>>> might feel that this means new features no longer have to work with Python
>>> 2, which would be confusing. Maybe it’s OK on both fronts, but it just
>>> seems scarier for users to do this now if we do plan to have Spark 3.0 in
>>> the next 6 months anyway.
>>>
>>> Matei
>>>
>>> > On Sep 17, 2018, at 1:04 PM, Mark Hamstra 
>>> wrote:
>>> >
>>> > What is the disadvantage to deprecating now in 2.4.0? I mean, it
>>> doesn't change the code at all; it's just a notification that we will
>>> eventually cease supporting Py2. Wouldn't users prefer to get that
>>> notification sooner rather than later?
>>> >
>>> > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>> > I’d like to understand the maintenance burden of Python 2 before
>>> deprecating it. Since it is not EOL yet, it might make sense to only
>>> deprecate it once it’s EOL (which is still over a year from now).
>>> Supporting Python 2+3 seems less burdensome than supporting, say, multiple
>>> Scala versions in the same codebase, so what are we losing out?
>>> >
>>> > The other thing is that even though Python core devs might not support
>>> 2.x later, it’s quite possible that various Linux distros will if moving
>>> from 2 to 3 remains painful. In that case, we may want Apache Spark to
>>> continue releasing for it despite the Python core devs not supporting it.
>>> >
>>> > Basically, I’d suggest to deprecate this in Spark 3.0 and then remove
>>> it later in 3.x instead of deprecating it in 2.4. I’d also consider looking
>>> at what other data science tools are doing before fully removing it: for
>>> example, if Pandas and TensorFlow no longer support Python 2 past some
>>> point, that might be a good point to remove it.
>>> >
>>> > Matei
>>> >
>>> > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra 
>>> wrote:
>>> > >
>>> > > If we're going to do that, then we need to do it right now, since
>>> 2.4.0 is already in release candidates.
>>> > >
>>> > > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson 
>>> wrote:
>>> > > I like Mark’s concept for deprecating Py2 starting with 2.4: It may
>>> seem like a ways off but even now there may be some spark versions
>>> supporting Py2 past the point where Py2 is no longer receiving security
>>> patches
>>> > >
>>> > >
>>> > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>> > > We could also deprecate Py2 already in the 2.4.0 release.
>>> > >
>>> > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>>> wrote:
>>> > > In case this didn't make it onto this thread:
>>> > >
>>> > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>> > >
>>> > > On S

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Xiangrui Meng

My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel
strongly about it. I would still suggest doing the following:

1. Link the POC mentioned in Q4. So people can verify the POC result.
2. List public APIs we plan to expose in Appendix A. I did a quick check.
Beside ColumnarBatch and ColumnarVector, we also need to make the following
public. People who are familiar with SQL internals should help assess the
risk.
* ColumnarArray
* ColumnarMap
* unsafe.types.CaledarInterval
* ColumnarRow
* UTF8String
* ArrayData
* ...
3. I still feel using Pandas UDF as the mid-term success doesn't match the
purpose of this SPIP. It does make some code cleaner. But I guess for ETL
use cases, it won't bring much value.

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Xiangrui Meng

ernal
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Xiangrui Meng

I posted my comment in the JIRA
.
Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0
release someday.
2. ML/DL systems that can benefits from columnar format are mostly in
Python.
3. Simple operations, though benefits vectorization, might not be worth the
data exchange overhead.

So would an improved Pandas UDF API would be good enough? For example,
SPARK-26412  (UDF that
takes an iterator of of Arrow batches).

Sorry that I should join the discussion earlier! Hope it is not too late:)

On Fri, Apr 19, 2019 at 1:20 PM  wrote:

> +1 (non-binding) for better columnar data processing support.
>
>
>
> *From:* Jules Damji 
> *Sent:* Friday, April 19, 2019 12:21 PM
> *To:* Bryan Cutler 
> *Cc:* Dev 
> *Subject:* Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
>
>
>
> + (non-binding)
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>
> +1 (non-binding)
>
>
>
> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>
> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
>
>
> Jason
>
>
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
> Hi everyone,
>
>
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
>
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
>
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
>
>
> [ ] +1: Accept the proposal as an official SPIP
>
> [ ] +0
>
> [ ] -1: I don't think this is a good idea because ...
>
>
>
>
>
> Thanks!
>
> Tom Graves
>
>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng

On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
wrote:

> Maybe.
>
> And I expect that we will end up doing something based on spark.task.cpus
> in the short term. I'd just rather that this SPIP not make it look like
> this is the way things should ideally be done. I'd prefer that we be quite
> explicit in recognizing that this approach is a significant compromise, and
> I'd like to see at least some references to the beginning of serious
> longer-term efforts to do something better in a deeper re-design of
> resource scheduling.
>

It is also a feature I desire as a user. How about suggesting it as a
future work in the SPIP? It certainly requires someone who fully
understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
don't know much about scheduler like you do, but I can speak for DL use
cases. Maybe we just view it from different angles. To you
application-level request is a significant compromise. To me it provides a
major milestone that brings GPU to Spark workload. I know many users who
tried to do DL on Spark ended up doing hacks here and there, huge pain. The
scope covered by the current SPIP makes those users much happier. Tom and
Andy from NVIDIA are certainly more calibrated on the usefulness of the
current proposal.


>
> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng  wrote:
>
>> There are certainly use cases where different stages require different
>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>> disagrees that ideally users should be able to do it. We are just dealing
>> with typical engineering trade-offs and see how we break it down into
>> smaller ones. I think it is fair to treat the task-level resource request
>> as a separate feature here because it also applies to CPUs alone without
>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>> years Spark is still able to cover many many use cases. Otherwise we
>> shouldn't see many Spark users around now. Here we just apply similar
>> arguments to GPUs.
>>
>> Initially, I was the person who really wanted task-level requests because
>> it is ideal. In an offline discussion, Andy Feng pointed out an
>> application-level setting should fit common deep learning training and
>> inference cases and it greatly simplifies necessary changes required to
>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>> the application-level approach became my first choice because it is still
>> very valuable but much less risky. If a feature brings great value to
>> users, we should add it even it is not ideal.
>>
>> Back to the default value discussion, let's forget GPUs and only consider
>> CPUs. Would an application-level default number of CPU cores disappear if
>> we added task-level requests? If yes, does it mean that users have to
>> explicitly state the resource requirements for every single stage? It is
>> tedious to do and who do not fully understand the impact would probably do
>> it wrong and waste even more resources. Then how many cores each task
>> should use if user didn't specify it? I do see "spark.task.cpus" is the
>> answer here. The point I want to make is that "spark.task.cpus", though
>> less ideal, is still needed when we have task-level requests for CPUs.
>>
>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
>> wrote:
>>
>>> I remain unconvinced that a default configuration at the application
>>> level makes sense even in that case. There may be some applications where
>>> you know a priori that almost all the tasks for all the stages for all the
>>> jobs will need some fixed number of gpus; but I think the more common cases
>>> will be dynamic configuration at the job or stage level. Stage level could
>>> have a lot of overlap with barrier mode scheduling -- barrier mode stages
>>> having a need for an inter-task channel resource, gpu-ified stages needing
>>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>>
>>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>>>
>>>> Say if we support per-task resource requests in the future, it would be
>>>> still inconvenient for users to declare the resource requirements for every
>>>> single task/stage. So there must be some default values defined somewhere
>>>> for task resource requirements. "spark.task.cpus" and
>>>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng

There are certainly use cases where different stages require different
number of CPUs or GPUs under an optimal setting. I don't think anyone
disagrees that ideally users should be able to do it. We are just dealing
with typical engineering trade-offs and see how we break it down into
smaller ones. I think it is fair to treat the task-level resource request
as a separate feature here because it also applies to CPUs alone without
GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
years Spark is still able to cover many many use cases. Otherwise we
shouldn't see many Spark users around now. Here we just apply similar
arguments to GPUs.

Initially, I was the person who really wanted task-level requests because
it is ideal. In an offline discussion, Andy Feng pointed out an
application-level setting should fit common deep learning training and
inference cases and it greatly simplifies necessary changes required to
Spark job scheduler. With Imran's feedback to the initial design sketch,
the application-level approach became my first choice because it is still
very valuable but much less risky. If a feature brings great value to
users, we should add it even it is not ideal.

Back to the default value discussion, let's forget GPUs and only consider
CPUs. Would an application-level default number of CPU cores disappear if
we added task-level requests? If yes, does it mean that users have to
explicitly state the resource requirements for every single stage? It is
tedious to do and who do not fully understand the impact would probably do
it wrong and waste even more resources. Then how many cores each task
should use if user didn't specify it? I do see "spark.task.cpus" is the
answer here. The point I want to make is that "spark.task.cpus", though
less ideal, is still needed when we have task-level requests for CPUs.

On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
wrote:

> I remain unconvinced that a default configuration at the application level
> makes sense even in that case. There may be some applications where you
> know a priori that almost all the tasks for all the stages for all the jobs
> will need some fixed number of gpus; but I think the more common cases will
> be dynamic configuration at the job or stage level. Stage level could have
> a lot of overlap with barrier mode scheduling -- barrier mode stages having
> a need for an inter-task channel resource, gpu-ified stages needing gpu
> resources, etc. Have I mentioned that I'm not a fan of the current barrier
> mode API, Xiangrui? :) Yes, I know: "Show me something better."
>
> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>
>> Say if we support per-task resource requests in the future, it would be
>> still inconvenient for users to declare the resource requirements for every
>> single task/stage. So there must be some default values defined somewhere
>> for task resource requirements. "spark.task.cpus" and
>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>> separated necessary GPU support from risky scheduler changes.
>>
>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
>> wrote:
>>
>>> Of course there is an issue of the perfect becoming the enemy of the
>>> good, so I can understand the impulse to get something done. I am left
>>> wanting, however, at least something more of a roadmap to a task-level
>>> future than just a vague "we may choose to do something more in the
>>> future." At the risk of repeating myself, I don't think the
>>> existing spark.task.cpus is very good, and I think that building more on
>>> that weak foundation without a more clear path or stated intention to move
>>> to something better runs the risk of leaving Spark stuck in a bad
>>> neighborhood.
>>>
>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves 
>>> wrote:
>>>
>>>> While I agree with you that it would be ideal to have the task level
>>>> resources and do a deeper redesign for the scheduler, I think that can be a
>>>> separate enhancement like was discussed earlier in the thread. That feature
>>>> is useful without GPU's.  I do realize that they overlap some but I think
>>>> the changes for this will be minimal to the scheduler, follow existing
>>>> conventions, and it is an improvement over what we have now. I know many
>>>> users will be happy to have this even without the task level scheduling as
>>>> many of the conventions used now to scheduler gpus can easily be broken by
>>>> one bad user. I think from

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng

il.com> wrote:
>>
>>
>> Thanks for this SPIP.
>> I cannot comment on the docs, but just wanted to highlight one thing. In
>> page 5 of the SPIP, when we talk about DRA, I see:
>>
>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each
>> task requires 1 CPU and 1GPU, then we shall throw an error on application
>> start because we shall always have at least 2 idle CPUs per executor"
>>
>> I am not sure this is a correct behavior. We might have tasks requiring
>> only CPU running in parallel as well, hence that may make sense. I'd rather
>> emit a WARN or something similar. Anyway we just said we will keep GPU
>> scheduling on task level out of scope for the moment, right?
>>
>> Thanks,
>> Marco
>>
>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
>> m...@databricks.com> ha scritto:
>>
>> Steve, the initial work would focus on GPUs, but we will keep the
>> interfaces general to support other accelerators in the future. This was
>> mentioned in the SPIP and draft design.
>>
>> Imran, you should have comment permission now. Thanks for making a pass!
>> I don't think the proposed 3.0 features should block Spark 3.0 release
>> either. It is just an estimate of what we could deliver. I will update the
>> doc to make it clear.
>>
>> Felix, it would be great if you can review the updated docs and let us
>> know your feedback.
>>
>> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>>
>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid 
>> wrote:
>>
>> Thanks for sending the updated docs.  Can you please give everyone the
>> ability to comment?  I have some comments, but overall I think this is a
>> good proposal and addresses my prior concerns.
>>
>> My only real concern is that I notice some mention of "must dos" for
>> spark 3.0.  I don't want to make any commitment to holding spark 3.0 for
>> parts of this, I think that is an entirely separate decision.  However I'm
>> guessing this is just a minor wording issue, and you really mean that's a
>> minimal set of features you are aiming for, which is reasonable.
>>
>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang 
>> wrote:
>>
>> Hi all,
>>
>> I updated the SPIP doc
>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>> and stories
>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>> I hope it now contains clear scope of the changes and enough details for
>> SPIP vote.
>> Please review the updated docs, thanks!
>>
>> Xiangrui Meng  于2019年3月6日周三 上午8:35写道：
>>
>> How about letting Xingbo make a major revision to the SPIP doc to make it
>> clear what proposed are? I like Felix's suggestion to switch to the new
>> Heilmeier template, which helps clarify what are proposed and what are not.
>> Then let's review the new SPIP and resume the vote.
>>
>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:
>>
>> OK, I suppose then we are getting bogged down into what a vote on an SPIP
>> means then anyway, which I guess we can set aside for now.  With the level
>> of detail in this proposal, I feel like there is a reasonable chance I'd
>> still -1 the design or implementation.
>>
>> And the other thing you're implicitly asking the community for is to
>> prioritize this feature for continued review and maintenance.  There is
>> already work to be done in things like making barrier mode support dynamic
>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
>> very concerned about getting spread too thin.
>>
>>
>> But if this is really just a vote on (1) is better gpu support important
>> for spark, in some form, in some release? and (2) is it *possible* to do
>> this in a safe way?  then I will vote +0.
>>
>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:
>>
>> So to me most of the questions here are implementation/design questions,
>> I've had this issue in the past with SPIP's where I expected to have more
>> high level design details but was basically told that belongs in the design
>> jira follow on. This makes me think we need to revisit what a SPIP really
>> need to contain, which should be done in a separate thread.  Note
>> personally I would be for having more high l

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-20 Thread Xiangrui Meng

Steve, the initial work would focus on GPUs, but we will keep the
interfaces general to support other accelerators in the future. This was
mentioned in the SPIP and draft design.

Imran, you should have comment permission now. Thanks for making a pass! I
don't think the proposed 3.0 features should block Spark 3.0 release
either. It is just an estimate of what we could deliver. I will update the
doc to make it clear.

Felix, it would be great if you can review the updated docs and let us know
your feedback.

** How about setting a tentative vote closing time to next Tue (Mar 26)?

On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid  wrote:

> Thanks for sending the updated docs.  Can you please give everyone the
> ability to comment?  I have some comments, but overall I think this is a
> good proposal and addresses my prior concerns.
>
> My only real concern is that I notice some mention of "must dos" for spark
> 3.0.  I don't want to make any commitment to holding spark 3.0 for parts of
> this, I think that is an entirely separate decision.  However I'm guessing
> this is just a minor wording issue, and you really mean that's a minimal
> set of features you are aiming for, which is reasonable.
>
> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> I updated the SPIP doc
>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>> and stories
>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>> I hope it now contains clear scope of the changes and enough details for
>> SPIP vote.
>> Please review the updated docs, thanks!
>>
>> Xiangrui Meng  于2019年3月6日周三 上午8:35写道：
>>
>>> How about letting Xingbo make a major revision to the SPIP doc to make
>>> it clear what proposed are? I like Felix's suggestion to switch to the new
>>> Heilmeier template, which helps clarify what are proposed and what are not.
>>> Then let's review the new SPIP and resume the vote.
>>>
>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid 
>>> wrote:
>>>
>>>> OK, I suppose then we are getting bogged down into what a vote on an
>>>> SPIP means then anyway, which I guess we can set aside for now.  With the
>>>> level of detail in this proposal, I feel like there is a reasonable chance
>>>> I'd still -1 the design or implementation.
>>>>
>>>> And the other thing you're implicitly asking the community for is to
>>>> prioritize this feature for continued review and maintenance.  There is
>>>> already work to be done in things like making barrier mode support dynamic
>>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
>>>> very concerned about getting spread too thin.
>>>>
>>>
>>>> But if this is really just a vote on (1) is better gpu support
>>>> important for spark, in some form, in some release? and (2) is it
>>>> *possible* to do this in a safe way?  then I will vote +0.
>>>>
>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:
>>>>
>>>>> So to me most of the questions here are implementation/design
>>>>> questions, I've had this issue in the past with SPIP's where I expected to
>>>>> have more high level design details but was basically told that belongs in
>>>>> the design jira follow on. This makes me think we need to revisit what a
>>>>> SPIP really need to contain, which should be done in a separate thread.
>>>>> Note personally I would be for having more high level details in it.
>>>>> But the way I read our documentation on a SPIP right now that detail
>>>>> is all optional, now maybe we could argue its based on what reviewers
>>>>> request, but really perhaps we should make the wording of that more
>>>>> required.  thoughts?  We should probably separate that discussion if 
>>>>> people
>>>>> want to talk about that.
>>>>>
>>>>> For this SPIP in particular the reason I +1 it is because it came down
>>>>> to 2 questions:
>>>>>
>>>>> 1) do I think spark should support this -> my answer is yes, I think
>>>>> this would improve spark, users have been requesting both better GPUs
>>>>> support and support for controlling container requests at a finer
>>&

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Xiangrui Meng

Sean, thanks for your input and making a pass on the updated SPIP!

As the next step, how about having a remote meeting to discuss the
remaining topics? I started a doodle poll here
<https://doodle.com/poll/33cthyc6f8i8naya>. Due to time constraint, I
suggest limiting the attendees to committers and posting the meeting
summary to JIRA after.

On Tue, Mar 19, 2019 at 10:16 AM Sean Owen  wrote:

> This looks like a great level of detail. The broad strokes look good to me.
>
> I'm happy with just about any story around what to do with Mesos GPU
> support now, but might at least deserve a mention: does the existing
> Mesos config simply become a deprecated alias for the
> spark.executor.accelerator.gpu.count? and no further support is added
> to Mesos? that seems entirely coherent, and if that's agreeable, could
> be worth a line here.
>

I would go with deprecated alias option. But I would defer the decision to
some committer who is willing to shepherd the Mesos sub-project.


>
> I think it could go into Spark 3 but need not block it. This doesn't
> say it does, merely says it's desirable to have it ready for 3.0 if
> possible. That seems like a fine position.
>
> On Mon, Mar 18, 2019 at 1:56 PM Xingbo Jiang 
> wrote:
> >
> > Hi all,
> >
> > I updated the SPIP doc and stories, I hope it now contains clear scope
> of the changes and enough details for SPIP vote.
> > Please review the updated docs, thanks!
> >
> > Xiangrui Meng  于2019年3月6日周三 上午8:35写道：
> >>
> >> How about letting Xingbo make a major revision to the SPIP doc to make
> it clear what proposed are? I like Felix's suggestion to switch to the new
> Heilmeier template, which helps clarify what are proposed and what are not.
> Then let's review the new SPIP and resume the vote.
> >>
> >> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid 
> wrote:
> >>>
> >>> OK, I suppose then we are getting bogged down into what a vote on an
> SPIP means then anyway, which I guess we can set aside for now.  With the
> level of detail in this proposal, I feel like there is a reasonable chance
> I'd still -1 the design or implementation.
> >>>
> >>> And the other thing you're implicitly asking the community for is to
> prioritize this feature for continued review and maintenance.  There is
> already work to be done in things like making barrier mode support dynamic
> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
> very concerned about getting spread too thin.
> >>>
> >>>
> >>> But if this is really just a vote on (1) is better gpu support
> important for spark, in some form, in some release? and (2) is it
> *possible* to do this in a safe way?  then I will vote +0.
> >>>
> >>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves 
> wrote:
> >>>>
> >>>> So to me most of the questions here are implementation/design
> questions, I've had this issue in the past with SPIP's where I expected to
> have more high level design details but was basically told that belongs in
> the design jira follow on. This makes me think we need to revisit what a
> SPIP really need to contain, which should be done in a separate thread.
> Note personally I would be for having more high level details in it.
> >>>> But the way I read our documentation on a SPIP right now that detail
> is all optional, now maybe we could argue its based on what reviewers
> request, but really perhaps we should make the wording of that more
> required.  thoughts?  We should probably separate that discussion if people
> want to talk about that.
> >>>>
> >>>> For this SPIP in particular the reason I +1 it is because it came
> down to 2 questions:
> >>>>
> >>>> 1) do I think spark should support this -> my answer is yes, I think
> this would improve spark, users have been requesting both better GPUs
> support and support for controlling container requests at a finer
> granularity for a while.  If spark doesn't support this then users may go
> to something else, so I think it we should support it
> >>>>
> >>>> 2) do I think its possible to design and implement it without causing
> large instabilities?   My opinion here again is yes. I agree with Imran and
> others that the scheduler piece needs to be looked at very closely as we
> have had a lot of issues there and that is why I was asking for more
> details in the design jira:
> https://issues.apache.org/jira/browse/SPARK-27005.  But I

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-05 Thread Xiangrui Meng

How about letting Xingbo make a major revision to the SPIP doc to make it
clear what proposed are? I like Felix's suggestion to switch to the new
Heilmeier template, which helps clarify what are proposed and what are not.
Then let's review the new SPIP and resume the vote.

On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:

> OK, I suppose then we are getting bogged down into what a vote on an SPIP
> means then anyway, which I guess we can set aside for now.  With the level
> of detail in this proposal, I feel like there is a reasonable chance I'd
> still -1 the design or implementation.
>
> And the other thing you're implicitly asking the community for is to
> prioritize this feature for continued review and maintenance.  There is
> already work to be done in things like making barrier mode support dynamic
> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
> very concerned about getting spread too thin.
>

> But if this is really just a vote on (1) is better gpu support important
> for spark, in some form, in some release? and (2) is it *possible* to do
> this in a safe way?  then I will vote +0.
>
> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:
>
>> So to me most of the questions here are implementation/design questions,
>> I've had this issue in the past with SPIP's where I expected to have more
>> high level design details but was basically told that belongs in the design
>> jira follow on. This makes me think we need to revisit what a SPIP really
>> need to contain, which should be done in a separate thread.  Note
>> personally I would be for having more high level details in it.
>> But the way I read our documentation on a SPIP right now that detail is
>> all optional, now maybe we could argue its based on what reviewers request,
>> but really perhaps we should make the wording of that more required.
>>  thoughts?  We should probably separate that discussion if people want to
>> talk about that.
>>
>> For this SPIP in particular the reason I +1 it is because it came down to
>> 2 questions:
>>
>> 1) do I think spark should support this -> my answer is yes, I think this
>> would improve spark, users have been requesting both better GPUs support
>> and support for controlling container requests at a finer granularity for a
>> while.  If spark doesn't support this then users may go to something else,
>> so I think it we should support it
>>
>> 2) do I think its possible to design and implement it without causing
>> large instabilities?   My opinion here again is yes. I agree with Imran and
>> others that the scheduler piece needs to be looked at very closely as we
>> have had a lot of issues there and that is why I was asking for more
>> details in the design jira:
>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe its
>> possible to do.
>>
>> If others have reservations on similar questions then I think we should
>> resolve here or take the discussion of what a SPIP is to a different thread
>> and then come back to this, thoughts?
>>
>> Note there is a high level design for at least the core piece, which is
>> what people seem concerned with, already so including it in the SPIP should
>> be straight forward.
>>
>> Tom
>>
>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>> im...@therashids.com> wrote:
>>
>>
>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>
>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
>> wrote:
>>
>> IMO upfront allocation is less useful. Specifically too expensive for
>> large jobs.
>>
>>
>> This is also an API/design discussion.
>>
>>
>> I agree with Felix -- this is more than just an API question.  It has a
>> huge impact on the complexity of what you're proposing.  You might be
>> proposing big changes to a core and brittle part of spark, which is already
>> short of experts.
>>
>> I don't see any value in having a vote on "does feature X sound cool?"
>> We have to evaluate the potential benefit against the risks the feature
>> brings and the continued maintenance cost.  We don't need super low-level
>> details, but we have to a sketch of the design to be able to make that
>> tradeoff.
>>
>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng

On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra  wrote:

> :) Sorry, that was ambiguous. I was seconding Imran's comment.
>

Could you also help review Xingbo's design sketch and help evaluate the
cost?


>
> On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng  wrote:
>
>>
>>
>> On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra 
>> wrote:
>>
>>> +1
>>>
>>
>> Mark, just to be clear, are you +1 on the SPIP or Imran's point?
>>
>>
>>>
>>> On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid 
>>> wrote:
>>>
>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>>>
>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> IMO upfront allocation is less useful. Specifically too expensive for
>>>>>> large jobs.
>>>>>>
>>>>>
>>>>> This is also an API/design discussion.
>>>>>
>>>>
>>>> I agree with Felix -- this is more than just an API question.  It has a
>>>> huge impact on the complexity of what you're proposing.  You might be
>>>> proposing big changes to a core and brittle part of spark, which is already
>>>> short of experts.
>>>>
>>>
>> To my understanding, Felix's comment is mostly on the user interfaces,
>> stating upfront allocation is less useful, specially for large jobs. I
>> agree that for large jobs we better have dynamic allocation, which was
>> mentioned in the YARN support section in the companion scoping doc. We
>> restrict the new container type to initially requested to keep things
>> simple. However upfront allocation already meets the requirements of basic
>> workflows like data + DL training/inference + data. Saying "it is less
>> useful specifically for large jobs" kinda missed the fact that "it is super
>> useful for basic use cases".
>>
>> Your comment is mostly on the implementation side, which IMHO it is the
>> KEY question to conclude this vote: does the design sketch sufficiently
>> demonstrate that the internal changes to Spark scheduler is manageable? I
>> read Xingbo's design sketch and I think it is doable, which led to my +1.
>> But I'm not an expert on the scheduler. So I would feel more confident if
>> the design was reviewed by some scheduler experts. I also read the design
>> sketch to support different cluster managers, which I think is less
>> critical than the internal scheduler changes.
>>
>>
>>>
>>>> I don't see any value in having a vote on "does feature X sound cool?"
>>>>
>>>
>> I believe no one would disagree. To prepare the companion doc, we went
>> through several rounds of discussions to provide concrete stories such that
>> the proposal is not just "cool".
>>
>>
>>>
>>>>
>>> We have to evaluate the potential benefit against the risks the feature
>>>> brings and the continued maintenance cost.  We don't need super low-level
>>>> details, but we have to a sketch of the design to be able to make that
>>>> tradeoff.
>>>>
>>>
>> Could you review the design sketch from Xingbo, help evaluate the cost,
>> and provide feedback?
>>
>>
>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng

On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra  wrote:

> +1
>

Mark, just to be clear, are you +1 on the SPIP or Imran's point?


>
> On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid  wrote:
>
>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>
>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
>>> wrote:
>>>
>>>> IMO upfront allocation is less useful. Specifically too expensive for
>>>> large jobs.
>>>>
>>>
>>> This is also an API/design discussion.
>>>
>>
>> I agree with Felix -- this is more than just an API question.  It has a
>> huge impact on the complexity of what you're proposing.  You might be
>> proposing big changes to a core and brittle part of spark, which is already
>> short of experts.
>>
>
To my understanding, Felix's comment is mostly on the user interfaces,
stating upfront allocation is less useful, specially for large jobs. I
agree that for large jobs we better have dynamic allocation, which was
mentioned in the YARN support section in the companion scoping doc. We
restrict the new container type to initially requested to keep things
simple. However upfront allocation already meets the requirements of basic
workflows like data + DL training/inference + data. Saying "it is less
useful specifically for large jobs" kinda missed the fact that "it is super
useful for basic use cases".

Your comment is mostly on the implementation side, which IMHO it is the KEY
question to conclude this vote: does the design sketch sufficiently
demonstrate that the internal changes to Spark scheduler is manageable? I
read Xingbo's design sketch and I think it is doable, which led to my +1.
But I'm not an expert on the scheduler. So I would feel more confident if
the design was reviewed by some scheduler experts. I also read the design
sketch to support different cluster managers, which I think is less
critical than the internal scheduler changes.


>
>> I don't see any value in having a vote on "does feature X sound cool?"
>>
>
I believe no one would disagree. To prepare the companion doc, we went
through several rounds of discussions to provide concrete stories such that
the proposal is not just "cool".


>
>>
> We have to evaluate the potential benefit against the risks the feature
>> brings and the continued maintenance cost.  We don't need super low-level
>> details, but we have to a sketch of the design to be able to make that
>> tradeoff.
>>
>
Could you review the design sketch from Xingbo, help evaluate the cost, and
provide feedback?

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng

On Mon, Mar 4, 2019 at 8:23 AM Xiangrui Meng  wrote:

>
>
> On Mon, Mar 4, 2019 at 7:24 AM Sean Owen  wrote:
>
>> To be clear, those goals sound fine to me. I don't think voting on
>> those two broad points is meaningful, but, does no harm per se. If you
>> mean this is just a check to see if people believe this is broadly
>> worthwhile, then +1 from me. Yes it is.
>>
>> That means we'd want to review something more detailed later, whether
>> it's a a) design doc we vote on or b) a series of pull requests. Given
>> the number of questions this leaves open, a) sounds better and I think
>> what you're suggesting. I'd call that the SPIP, but, so what, it's
>> just a name. The thing is, a) seems already mostly done, in the second
>> document that was attached.
>
>
> It is far from done. We still need to review the APIs and the design for
> each major component:
>
> * Internal changes to Spark job scheduler.
> * Interfaces exposed to users.
> * Interfaces exposed to cluster managers.
> * Standalone / auto-discovery.
> * YARN
> * K8s
> * Mesos
> * Jenkins
>
> I try to avoid discussing each of them in this thread because they require
> different domain experts. After we have a high-level agreement on adding
> accelerator support to Spark. We can kick off the work in parallel. If any
> committer thinks a follow-up work still needs an SPIP, we just follow the
> SPIP process to resolve it.
>
>
>> I'm hesitating because i'm not sure why
>> it's important to not discuss that level of detail here, as it's
>> already available. Just too much noise?
>
>
> Yes. If we go down one or two levels, we might have to pull in different
> domain experts for different questions.
>
>
>> but voting for this seems like
>> endorsing those decisions, as I can only assume the proposer is going
>> to continue the design with those decisions in mind.
>>
>
> That is certainly not the purpose, which was why there were two docs, not
> just one SPIP. The purpose of the companion doc is just to give some
> concrete stories and estimate what could be done in Spark 3.0. Maybe we
> should update the SPIP doc and make it clear that certain features are
> pending follow-up discussions.
>
>
>>
>> What's the next step in your view, after this, and before it's
>> implemented? as long as there is one, sure, let's punt. Seems like we
>> could begin that conversation nowish.
>>
>
> We should assign each major component an "owner" who can lead the
> follow-up work, e.g.,
>
> * Internal changes to Spark scheduler
> * Interfaces to cluster managers and users
> * Standalone support
> * YARN support
> * K8s support
> * Mesos support
> * Test infrastructure
> * FPGA
>
> Again, for each component the question we should answer first is "Is it
> important?" and then "How to implement it?". Community members who are
> interested in each discussion should subscribe to the corresponding JIRA.
> If some committer think we need a follow-up SPIP, either to make more
> members aware of the changes or to reach agreement, feel free to call it
> out.
>
>
>>
>> Many of those questions you list are _fine_ for a SPIP, in my opinion.
>> (Of course, I'd add what cluster managers are in/out of scope.)
>>
>
> I think the two requires more discussion are Mesos and K8s. Let me follow
> what I suggested above and try to answer two questions for each:
>
> Mesos:
> * Is it important? There are certainly Spark/Mesos users but the overall
> usage is going downhill. See the attached Google Trend snapshot.
>

[image: Screen Shot 2019-03-04 at 8.10.50 AM.png]


> * How to implement it? I believe it is doable, similarly to other cluster
> managers. However, we need to find someone from our community to do the
> work. If we cannot find such a person, it would indicate that the feature
> is not that important.
>
> K8s:
> * Is it important? K8s is the fastest growing manager. But the current
> Spark support is experimental. Building features on top would add
> additional cost if we want to make changes.
> * How to implement it? There is a sketch in the companion doc. Yinan
> mentioned three options to expose the inferences to users. We need to
> finalize the design and discuss which option is the best to go.
>
> You see that such discussions can be done in parallel. It is not efficient
> if we block the work on K8s because we cannot decide whether we should
> support Mesos.
>
>
>>
>>
>> On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng  wrote:
>> >
>> &g

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng

On Mon, Mar 4, 2019 at 7:24 AM Sean Owen  wrote:

> To be clear, those goals sound fine to me. I don't think voting on
> those two broad points is meaningful, but, does no harm per se. If you
> mean this is just a check to see if people believe this is broadly
> worthwhile, then +1 from me. Yes it is.
>
> That means we'd want to review something more detailed later, whether
> it's a a) design doc we vote on or b) a series of pull requests. Given
> the number of questions this leaves open, a) sounds better and I think
> what you're suggesting. I'd call that the SPIP, but, so what, it's
> just a name. The thing is, a) seems already mostly done, in the second
> document that was attached.

It is far from done. We still need to review the APIs and the design for
each major component:

* Internal changes to Spark job scheduler.
* Interfaces exposed to users.
* Interfaces exposed to cluster managers.
* Standalone / auto-discovery.
* YARN
* K8s
* Mesos
* Jenkins

I try to avoid discussing each of them in this thread because they require
different domain experts. After we have a high-level agreement on adding
accelerator support to Spark. We can kick off the work in parallel. If any
committer thinks a follow-up work still needs an SPIP, we just follow the
SPIP process to resolve it.

> I'm hesitating because i'm not sure why
> it's important to not discuss that level of detail here, as it's
> already available. Just too much noise?

Yes. If we go down one or two levels, we might have to pull in different
domain experts for different questions.

> but voting for this seems like
> endorsing those decisions, as I can only assume the proposer is going
> to continue the design with those decisions in mind.
>

That is certainly not the purpose, which was why there were two docs, not
just one SPIP. The purpose of the companion doc is just to give some
concrete stories and estimate what could be done in Spark 3.0. Maybe we
should update the SPIP doc and make it clear that certain features are
pending follow-up discussions.

>
> What's the next step in your view, after this, and before it's
> implemented? as long as there is one, sure, let's punt. Seems like we
> could begin that conversation nowish.
>

We should assign each major component an "owner" who can lead the follow-up
work, e.g.,

* Internal changes to Spark scheduler
* Interfaces to cluster managers and users
* Standalone support
* YARN support
* K8s support
* Mesos support
* Test infrastructure
* FPGA

Again, for each component the question we should answer first is "Is it
important?" and then "How to implement it?". Community members who are
interested in each discussion should subscribe to the corresponding JIRA.
If some committer think we need a follow-up SPIP, either to make more
members aware of the changes or to reach agreement, feel free to call it
out.

>
> Many of those questions you list are _fine_ for a SPIP, in my opinion.
> (Of course, I'd add what cluster managers are in/out of scope.)
>

I think the two requires more discussion are Mesos and K8s. Let me follow
what I suggested above and try to answer two questions for each:

Mesos:
* Is it important? There are certainly Spark/Mesos users but the overall
usage is going downhill. See the attached Google Trend snapshot.
* How to implement it? I believe it is doable, similarly to other cluster
managers. However, we need to find someone from our community to do the
work. If we cannot find such a person, it would indicate that the feature
is not that important.

K8s:
* Is it important? K8s is the fastest growing manager. But the current
Spark support is experimental. Building features on top would add
additional cost if we want to make changes.
* How to implement it? There is a sketch in the companion doc. Yinan
mentioned three options to expose the inferences to users. We need to
finalize the design and discuss which option is the best to go.

You see that such discussions can be done in parallel. It is not efficient
if we block the work on K8s because we cannot decide whether we should
support Mesos.

>
>
> On Mon, Mar 4, 2019 at 9:07 AM Xiangrui Meng  wrote:
> >
> > What finer "high level" goals do you recommend? To make progress on the
> vote, it would be great if you can articulate more. Current SPIP proposes
> two high-level changes to make Spark accelerator-aware:
> >
> > At cluster manager level, we update or upgrade cluster managers to
> include GPU support. Then we expose user interfaces for Spark to request
> GPUs from them.
> > Within Spark, we update its scheduler to understand available GPUs
> allocated to executors, user task requests, and assign GPUs to tasks
> properly.
> >
> > How do you want to change or refine th

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng

What finer "high level" goals do you recommend? To make progress on the
vote, it would be great if you can articulate more. Current SPIP proposes
two high-level changes to make Spark accelerator-aware:

   - At cluster manager level, we update or upgrade cluster managers to
   include GPU support. Then we expose user interfaces for Spark to request
   GPUs from them.
   - Within Spark, we update its scheduler to understand available GPUs
   allocated to executors, user task requests, and assign GPUs to tasks
   properly.

How do you want to change or refine them? I saw you raised questions around
Horovod requirements and GPU/memory allocation. But there are tens of
questions at the same or even higher level. E.g., in preparing the
companion scoping doc we saw the following questions:

* How to test GPU support on Jenkins?
* Does the solution proposed also work for FPGA? What are the diffs?
* How to make standalone workers auto-discover GPU resources?
* Do we want to allow users to request GPU resources in Pandas UDF?
* How does user pass the GPU requests to K8s, spark-submit command-line or
pod template?
* Do we create a separate queue for GPU task scheduling so it doesn't cause
regression on normal jobs?
* How to monitor the utilization of GPU? At what levels?
* Do we want to support GPU-backed physical operators?
* Do we allow users to request both non-default number of CPUs and GPUs?
* ...

IMHO, we cannot nor we should answer questions at this level in this vote.
The vote is majorly on whether we should make Spark accelerator-aware to
help unify big data and AI solutions, specifically whether Spark should
provide proper support to deep learning model training and inference where
accelerators are essential. My +1 vote is based on the following logic:

* It is important for Spark to become the de facto solution in connecting
big data and AI.
* The work is doable given the design sketch and the early
investigation/scoping.

To me, "-1" means either it is not important for Spark to support such use
cases or we certainly cannot afford to implement such support. This is my
understanding of the SPIP and the vote. It would be great if you can
elaborate what changes you want to make or what answers you want to see.

On Sun, Mar 3, 2019 at 11:13 PM Felix Cheung 
wrote:

> Once again, I’d have to agree with Sean.
>
> Let’s table the meaning of SPIP for another time, say. I think a few of us
> are trying to understand what does “accelerator resource aware” mean.
>
As far as I know, no one is discussing API here. But on google doc, JIRA
> and on email and off list, I have seen questions, questions that are
> greatly concerning, like “oh scheduler is allocating GPU, but how does it
> affect memory” and many more, and so I think finer “high level” goals
> should be defined.
>
>
>
>
> --
> *From:* Sean Owen 
> *Sent:* Sunday, March 3, 2019 5:24 PM
> *To:* Xiangrui Meng
> *Cc:* Felix Cheung; Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
> *Subject:* Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> I think treating SPIPs as this high-level takes away much of the point
> of VOTEing on them. I'm not sure that's even what Reynold is
> suggesting elsewhere; we're nowhere near discussing APIs here, just
> what 'accelerator aware' even generally means. If the scope isn't
> specified, what are we trying to bind with a formal VOTE? The worst I
> can say is that this doesn't mean much, so the outcome of the vote
> doesn't matter. The general ideas seems fine to me and I support
> _something_ like this.
>
> I think the subtext concern is that SPIPs become a way to request
> cover to make a bunch of decisions separately, later. This is, to some
> extent, how it has to work. A small number of interested parties need
> to decide the details coherently, not design the whole thing by
> committee, with occasional check-ins for feedback. There's a balance
> between that, and using the SPIP as a license to go finish a design
> and proclaim it later. That's not anyone's bad-faith intention, just
> the risk of deferring so much.
>
> Mesos support is not a big deal by itself but a fine illustration of
> the point. That seems like a fine question of scope now, even if the
> 'how' or some of the 'what' can be decided later. I raised an eyebrow
> here at the reply that this was already judged out-of-scope: how much
> are we on the same page about this being a point to consider feedback?
>
> If one wants to VOTE on more details, then this vote just doesn't
> matter much. Is a future step to VOTE on some more detailed design
> doc? Then that's what I call a "SPIP" and it's practically just
> semantics.
>
>
> On Sun, Mar 3

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Xiangrui Meng

Hi Felix,

Just to clarify, we are voting on the SPIP, not the companion scoping doc.
What is proposed and what we are voting on is to make Spark
accelerator-aware. The companion scoping doc and the design sketch are to
help demonstrate that what features could be implemented based on the use
cases and dev resources the co-authors are aware of. The exact scoping and
design would require more community involvement, by no means we are
finalizing it in this vote thread.

I think copying the goals and non-goals from the companion scoping doc to
the SPIP caused the confusion. As mentioned in the SPIP, we proposed to
make two major changes at high level:

   - At cluster manager level, we update or upgrade cluster managers to
   include GPU support. Then we expose user interfaces for Spark to request
   GPUs from them.
   - Within Spark, we update its scheduler to understand available GPUs
   allocated to executors, user task requests, and assign GPUs to tasks
   properly.

We should keep our vote discussion at this level. It doesn't exclude
Mesos/Windows/TPU/FPGA, nor it commits to support YARN/K8s. Through the
initial scoping work, we found that we certainly need domain experts to
discuss the support of each cluster manager and each accelerator type. But
adding more details on Mesos or FPGA doesn't change the SPIP at high level.
So we concluded the initial scoping, shared the docs, and started this vote.

I suggest updating the goals and non-goals in the SPIP so we don't turn the
vote into discussing a specific cluster manager support or non-support.
After we reach a high-level agreement, the work can be fairly distributed.
If there are both strong demand and dev resources from the community for a
specific cluster manager or an accelerator type, I don't see why we should
block the work. If the work requires more discussion, we can start a new
SPIP thread.

Also see my inline comments below.

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
wrote:

> Great points Sean.
>
> Here’s what I’d like to suggest to move forward.
> Split the SPIP.
>
> If we want to propose upfront homogeneous allocation (aka
> spark.task.gpus), this should be one on its own and for instance,
>

This is more like an API/design discussion, which can be done after the
vote. I don't think the feature alone needs a separate SPIP thread. On the
high level, spark users should be able to request and use GPUs properly.
How to implement is pending the design.

> I really agree with Sean (like I did in the discuss thread) that we can’t
> simply non-goal Mesos. We have enough maintenance issue as it is. And IIRC
> there was a PR proposed for K8S that I’d like to see bring that discussion
> here as well.
>

+1. As I mentioned above, discussing each cluster manager support requires
domain experts. The goals and non-goals in the SPIP caused this confusion.
I suggest updating the goals and non-goals and then having separate
discussion for each that doesn't block the main SPIP vote. It would be
great if you or Sean can lead the discussion on Mesos support.

>
> IMO upfront allocation is less useful. Specifically too expensive for
> large jobs.
>

This is also an API/design discussion.

>
> If we want per-stage resource request, this should a full SPIP with a lot
> more details to be hashed out. Our work with Horovod brings a few specific
> and critical requirements on how this should work with distributed DL and I
> would like to see those addressed.
>

SPIP is designed to not have a lot details. I agree with what Reynold said
on the Table Metadata thread:

"""
In general it'd be better to have the SPIPs be higher level, and put the
detailed APIs in a separate doc. Alternatively, put them in the SPIP but
explicitly vote on the high level stuff and not the detailed APIs.
"""

Could you create a JIRA and document the list of requirements from Horovod
use cases?

>
> In any case I’d like to see more consensus before moving forward, until
> then I’m going to -1 this.
>
>
>
> --
> *From:* Sean Owen 
> *Sent:* Sunday, March 3, 2019 8:15 AM
> *To:* Felix Cheung
> *Cc:* Xingbo Jiang; Yinan Li; dev; Weichen Xu; Marco Gaido
> *Subject:* Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling
>
> I'm for this in general, at least a +0. I do think this has to have a
> story for what to do with the existing Mesos GPU support, which sounds
> entirely like the spark.task.gpus config here. Maybe it's just a
> synonym? that kind of thing.
>
> Requesting different types of GPUs might be a bridge too far, but,
> that's a P2 detail that can be hashed out later. (For example, if a
> v100 is available and k80 was requested, do you use it or fail? is the
> right level of resource control GPU RAM and cores?)
>
> The per-stage resource requirements sounds like the biggest change;
> you can even change CPU cores requested per pandas UDF? and what about
> memory then? We'll see how that shakes out. That's the only thing I'm
> kind of unsure about

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xiangrui Meng

+1

Btw, as Ryan pointed out las time, +0 doesn't mean "Don't really care."
Official definitions here:

https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions


   -

   +0: 'I don't feel strongly about it, but I'm okay with this.'
   -

   -0: 'I won't get in the way, but I'd rather we didn't do this.'


On Fri, Mar 1, 2019 at 6:27 AM Mingjie  wrote:

> +1
>
> mingjie
>
> On Mar 1, 2019, at 10:18 PM, Xingbo Jiang  wrote:
>
> Start with +1 from myself.
>
> Xingbo Jiang  于2019年3月1日周五 下午10:14写道：
>
>> Hi all,
>>
>> I want to call for a vote of SPARK-24615
>> . It improves Spark
>> by making it aware of GPUs exposed by cluster managers, and hence Spark can
>> match GPU resources with user task requests properly. The proposal
>> 
>>  and production doc
>> 
>>  was
>> made available on dev@ to collect input. Your can also find a design
>> sketch at SPARK-27005 
>> .
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thank you!
>>
>> Xingbo
>>
>

Re: SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xiangrui Meng

In case there are issues visiting Google doc, I attached PDF files to the
JIRA.

On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang  wrote:

> Hi all,
>
> I want send a revised SPIP on implementing Accelerator(GPU)-aware
> Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
> managers, and hence Spark can match GPU resources with user task requests
> properly. If you have scenarios that need to run workloads(DL/ML/Signal
> Processing etc.) on Spark cluster with GPU nodes, please help review and
> check how it fits into your use cases. Your feedback would be greatly
> appreciated!
>
> # Links to SPIP and Product doc:
>
> * Jira issue for the SPIP:
> https://issues.apache.org/jira/browse/SPARK-24615
> * Google Doc:
> https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
> * Product Doc:
> https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing
>
> Thank you!
>
> Xingbo
>

[VOTE] [RESULT] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng

Hi all,

The vote passed with the following +1s (* = binding) and no 0s/-1s:

* Denny Lee
* Jules Damji
* Xiao Li*
* Dongjoon Hyun
* Mingjie Tang
* Yanbo Liang*
* Marco Gaido
* Joseph Bradley*
* Xiangrui Meng*

Please watch SPARK-25994 and join future discussions there. Thanks!

Best,
Xiangrui

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng

+1 from myself.

The vote passed with the following +1s and no -1s:

* Denny Lee
* Jules Damji
* Xiao Li*
* Dongjoon Hyun
* Mingjie Tang
* Yanbo Liang*
* Marco Gaido
* Joseph Bradley*
* Xiangrui Meng*

I will send a result email soon. Please watch SPARK-25994 for future
discussions. Thanks!

Best,
Xiangrui


On Mon, Feb 11, 2019 at 10:14 PM Joseph Bradley 
wrote:

> +1  This will be a great long-term investment for Spark.
>
> On Wed, Feb 6, 2019 at 8:44 AM Marco Gaido  wrote:
>
>> +1 from me as well.
>>
>> Il giorno mer 6 feb 2019 alle ore 16:58 Yanbo Liang 
>> ha scritto:
>>
>>> +1 for the proposal
>>>
>>>
>>>
>>> On Thu, Jan 31, 2019 at 12:46 PM Mingjie Tang 
>>> wrote:
>>>
>>>> +1, this is a very very important feature.
>>>>
>>>> Mingjie
>>>>
>>>> On Thu, Jan 31, 2019 at 12:42 AM Xiao Li  wrote:
>>>>
>>>>> Change my vote from +1 to ++1
>>>>>
>>>>> Xiangrui Meng  于2019年1月30日周三 上午6:20写道：
>>>>>
>>>>>> Correction: +0 vote doesn't mean "Don't really care". Thanks Ryan for
>>>>>> the offline reminder! Below is the Apache official interpretation
>>>>>> <https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions>
>>>>>> of fraction values:
>>>>>>
>>>>>> The in-between values are indicative of how strongly the voting
>>>>>> individual feels. Here are some examples of fractional votes and ways in
>>>>>> which they might be intended and interpreted:
>>>>>> +0: 'I don't feel strongly about it, but I'm okay with this.'
>>>>>> -0: 'I won't get in the way, but I'd rather we didn't do this.'
>>>>>> -0.5: 'I don't like this idea, but I can't find any rational
>>>>>> justification for my feelings.'
>>>>>> ++1: 'Wow! I like this! Let's do it!'
>>>>>> -0.9: 'I really don't like this, but I'm not going to stand in the
>>>>>> way if everyone else wants to go ahead with it.'
>>>>>> +0.9: 'This is a cool idea and i like it, but I don't have time/the
>>>>>> skills necessary to help out.'
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 30, 2019 at 12:31 AM Martin Junghanns
>>>>>>  wrote:
>>>>>>
>>>>>>> Hi Dongjoon,
>>>>>>>
>>>>>>> Thanks for the hint! I updated the SPIP accordingly.
>>>>>>>
>>>>>>> I also changed the access permissions for the SPIP and design sketch
>>>>>>> docs so that anyone can comment.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Martin
>>>>>>> On 29.01.19 18:59, Dongjoon Hyun wrote:
>>>>>>>
>>>>>>> Hi, Xiangrui Meng.
>>>>>>>
>>>>>>> +1 for the proposal.
>>>>>>>
>>>>>>> However, please update the following section for this vote. As we
>>>>>>> see, it seems to be inaccurate because today is Jan. 29th. (Almost
>>>>>>> February).
>>>>>>> (Since I cannot comment on the SPIP, I replied here.)
>>>>>>>
>>>>>>> Q7. How long will it take?
>>>>>>>
>>>>>>>-
>>>>>>>
>>>>>>>If accepted by the community by the end of December 2018, we
>>>>>>>predict to be feature complete by mid-end March, allowing for QA 
>>>>>>> during
>>>>>>>April 2019, making the SPIP part of the next major Spark release 
>>>>>>> (3.0, ETA
>>>>>>>May, 2019).
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Tue, Jan 29, 2019 at 8:52 AM Xiao Li 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Jules Damji  于2019年1月29日周二 上午8:14写道：
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>> (Heard their proposed tech-talk at Spark + A.I s

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-30 Thread Xiangrui Meng

Correction: +0 vote doesn't mean "Don't really care". Thanks Ryan for the
offline reminder! Below is the Apache official interpretation
<https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions>
of fraction values:

The in-between values are indicative of how strongly the voting individual
feels. Here are some examples of fractional votes and ways in which they
might be intended and interpreted:
+0: 'I don't feel strongly about it, but I'm okay with this.'
-0: 'I won't get in the way, but I'd rather we didn't do this.'
-0.5: 'I don't like this idea, but I can't find any rational justification
for my feelings.'
++1: 'Wow! I like this! Let's do it!'
-0.9: 'I really don't like this, but I'm not going to stand in the way if
everyone else wants to go ahead with it.'
+0.9: 'This is a cool idea and i like it, but I don't have time/the skills
necessary to help out.'


On Wed, Jan 30, 2019 at 12:31 AM Martin Junghanns
 wrote:

> Hi Dongjoon,
>
> Thanks for the hint! I updated the SPIP accordingly.
>
> I also changed the access permissions for the SPIP and design sketch docs
> so that anyone can comment.
>
> Best,
>
> Martin
> On 29.01.19 18:59, Dongjoon Hyun wrote:
>
> Hi, Xiangrui Meng.
>
> +1 for the proposal.
>
> However, please update the following section for this vote. As we see, it
> seems to be inaccurate because today is Jan. 29th. (Almost February).
> (Since I cannot comment on the SPIP, I replied here.)
>
> Q7. How long will it take?
>
>-
>
>If accepted by the community by the end of December 2018, we predict
>to be feature complete by mid-end March, allowing for QA during April 2019,
>making the SPIP part of the next major Spark release (3.0, ETA May, 2019).
>
> Bests,
> Dongjoon.
>
> On Tue, Jan 29, 2019 at 8:52 AM Xiao Li  wrote:
>
>> +1
>>
>> Jules Damji  于2019年1月29日周二 上午8:14写道：
>>
>>> +1 (non-binding)
>>> (Heard their proposed tech-talk at Spark + A.I summit in London. Well
>>> attended & well received.)
>>>
>>> —
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Jan 29, 2019, at 7:30 AM, Denny Lee  wrote:
>>>
>>> +1
>>>
>>> yay - let's do it!
>>>
>>> On Tue, Jan 29, 2019 at 6:28 AM Xiangrui Meng  wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to call for a vote of SPARK-25994
>>>> <https://issues.apache.org/jira/browse/SPARK-25994>. It introduces a
>>>> new DataFrame-based component to Spark, which supports property graph
>>>> construction, Cypher queries, and graph algorithms. The proposal
>>>> <https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit>
>>>> was made available on user@
>>>> <https://lists.apache.org/thread.html/269cbffb04a0fbfe2ec298c3e95f01c05b47b5a72838004d27b74169@%3Cuser.spark.apache.org%3E>
>>>> and dev@
>>>> <https://lists.apache.org/thread.html/c4c9c9d31caa4a9be3dd99444e597b43f7cd2823e456be9f108e8193@%3Cdev.spark.apache.org%3E>
>>>>  to
>>>> collect input. You can also find a sketch design doc attached to
>>>> SPARK-26028 <https://issues.apache.org/jira/browse/SPARK-26028>.
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>

[VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-29 Thread Xiangrui Meng

Hi all,

I want to call for a vote of SPARK-25994
. It introduces a new
DataFrame-based component to Spark, which supports property graph
construction, Cypher queries, and graph algorithms. The proposal

was made available on user@

and dev@

to
collect input. You can also find a sketch design doc attached to SPARK-26028
.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Best,
Xiangrui

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng

Hi all,

I want to re-send the previous SPIP on introducing a DataFrame-based graph
component to collect more feedback. It supports property graphs, Cypher
graph queries, and graph algorithms built on top of the DataFrame API. If
you are a GraphX user or your workload is essentially graph queries, please
help review and check how it fits into your use cases. Your feedback would
be greatly appreciated!

# Links to SPIP and design sketch:

* Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994
* Google Doc:
https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing
* Jira issue for a first design sketch:
https://issues.apache.org/jira/browse/SPARK-26028
* Google Doc:
https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing

# Sample code:

~~~
val graph = ...

// query
val result = graph.cypher("""
  MATCH (p:Person)-[r:STUDY_AT]->(u:University)
  RETURN p.name, r.since, u.name
""")

// algorithms
val ranks = graph.pageRank.run()
~~~

Best,
Xiangrui

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng

(don't know why your email ends with ".invalid")

On Wed, Dec 19, 2018 at 9:13 AM Xiangrui Meng  wrote:

>
>
> On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach 
> wrote:
> >
> > [Note: I sent this earlier but it looks like the email was blocked
> because I had another email group on the CC line]
> >
> > Hi Spark Dev,
> >
> > I would like to use the new barrier execution mode introduced in spark
> 2.4 with LightGBM in the spark package mmlspark but I ran into some issues
> and I had a couple questions.
> >
> > Currently, the LightGBM distributed learner tries to figure out the
> number of cores on the cluster and then does a coalesce and a
> mapPartitions, and inside the mapPartitions we do a NetworkInit (where the
> address:port of all workers needs to be passed in the constructor) and pass
> the data in-memory to the native layer of the distributed lightgbm learner.
> >
> >
> >
> > With barrier execution mode, I think the code would become much more
> robust.  However, there are several issues that I am running into when
> trying to move my code over to the new barrier execution mode scheduler:
> >
> > Does not support dynamic allocation – however, I think it would be
> convenient if it restarted the job when the number of workers has decreased
> and allowed the dev to decide whether to restart the job if the number of
> workers increased
>
> How does mmlspark handle dynamic allocation? Do you have a watch thread on
> the driver to restart the job if there are more workers? And when the
> number of workers decrease, can training continue without driver involved?
>
> > Does not work with DataFrame or Dataset API, but I think it would be
> much more convenient if it did
>
> DataFrame/Dataset do not have APIs to let users scan through the entire
> partition. The closest is Pandas UDF, which scans data per batch. I'm
> thinking about the following:
>
> If we change Pandas UDF to take an iterator of record batches (instead of
> a single batch), and per contract we say this iterator will iterate through
> the entire partition. So you only need to do NetworkInit once.
>
> > How does barrier execution mode deal with #partitions > #tasks?  If the
> number of partitions is larger than the number of “tasks” or workers, can
> barrier execution mode automatically coalesce the dataset to have #
> partitions == # tasks?
>
> It will hang there and print warning messages. We didn't assume user code
> can correctly handle dynamic worker sizes.
>
> > It would be convenient to be able to get network information about all
> other workers in the cluster that are in the same barrier execution, eg the
> host address and some task # or identifier of all workers
>
> See getTaskInfos() at
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.BarrierTaskContext
> .
>
> We also provide a barrier() method there to assist simple coordination
> among workers.
>
> >
> >
> >
> > I would love to hear more about this new feature – also I had trouble
> finding documentation (JIRA:
> https://issues.apache.org/jira/browse/SPARK-24374, High level design:
> https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef&v=&b=&from_search=1),
> are there any good examples of spark packages that have moved to use the
> new barrier execution mode in spark 2.4?
>
> Databricks (which I'm an employee of) implemented HorovodRunner
> <https://databricks.com/blog/2018/11/19/introducing-horovodrunner-for-distributed-deep-learning-training.html>,
> which fully utilizes barrier execution mode. There is also a
> work-in-process open-source integration of Horovod/PySpark from Horovod
> author. Doing distributed deep learning training was the main use case
> considered in the design.
>
> Shall we have an offline meeting or open a JIRA to discuss more details
> about integrating mmlspark w/ barrier execution mode?
>
> >
> >
> >
> > Thank you, Ilya
>

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng

On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach 
wrote:
>
> [Note: I sent this earlier but it looks like the email was blocked
because I had another email group on the CC line]
>
> Hi Spark Dev,
>
> I would like to use the new barrier execution mode introduced in spark
2.4 with LightGBM in the spark package mmlspark but I ran into some issues
and I had a couple questions.
>
> Currently, the LightGBM distributed learner tries to figure out the
number of cores on the cluster and then does a coalesce and a
mapPartitions, and inside the mapPartitions we do a NetworkInit (where the
address:port of all workers needs to be passed in the constructor) and pass
the data in-memory to the native layer of the distributed lightgbm learner.
>
>
>
> With barrier execution mode, I think the code would become much more
robust.  However, there are several issues that I am running into when
trying to move my code over to the new barrier execution mode scheduler:
>
> Does not support dynamic allocation – however, I think it would be
convenient if it restarted the job when the number of workers has decreased
and allowed the dev to decide whether to restart the job if the number of
workers increased

How does mmlspark handle dynamic allocation? Do you have a watch thread on
the driver to restart the job if there are more workers? And when the
number of workers decrease, can training continue without driver involved?

> Does not work with DataFrame or Dataset API, but I think it would be much
more convenient if it did

DataFrame/Dataset do not have APIs to let users scan through the entire
partition. The closest is Pandas UDF, which scans data per batch. I'm
thinking about the following:

If we change Pandas UDF to take an iterator of record batches (instead of a
single batch), and per contract we say this iterator will iterate through
the entire partition. So you only need to do NetworkInit once.

> How does barrier execution mode deal with #partitions > #tasks?  If the
number of partitions is larger than the number of “tasks” or workers, can
barrier execution mode automatically coalesce the dataset to have #
partitions == # tasks?

It will hang there and print warning messages. We didn't assume user code
can correctly handle dynamic worker sizes.

> It would be convenient to be able to get network information about all
other workers in the cluster that are in the same barrier execution, eg the
host address and some task # or identifier of all workers

See getTaskInfos() at
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.BarrierTaskContext
.

We also provide a barrier() method there to assist simple coordination
among workers.

>
>
>
> I would love to hear more about this new feature – also I had trouble
finding documentation (JIRA:
https://issues.apache.org/jira/browse/SPARK-24374, High level design:
https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef&v=&b=&from_search=1),
are there any good examples of spark packages that have moved to use the
new barrier execution mode in spark 2.4?

Databricks (which I'm an employee of) implemented HorovodRunner
,
which fully utilizes barrier execution mode. There is also a
work-in-process open-source integration of Horovod/PySpark from Horovod
author. Doing distributed deep learning training was the main use case
considered in the design.

Shall we have an offline meeting or open a JIRA to discuss more details
about integrating mmlspark w/ barrier execution mode?

>
>
>
> Thank you, Ilya

Re: SPIP: Property Graphs, Cypher Queries, and Algorithms

2018-11-13 Thread Xiangrui Meng

+Joseph Gonzalez  +Ankur Dave


On Tue, Nov 13, 2018 at 2:55 AM Martin Junghanns 
wrote:

> Hi Spark community,
>
> We would like to propose a new graph module for Apache Spark with support
> for Property Graphs, Cypher graph queries and graph algorithms built on top
> of the DataFrame API.
>
> Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994
> Google Doc:
> https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing
>
> Jira issue for a first design sketch:
> https://issues.apache.org/jira/browse/SPARK-26028
> Google Doc:
> https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing
>
> Thanks,
>
> Martin (on behalf of the Neo4j Cypher for Apache Spark team)
>
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[image: Spark+AI Summit Europe]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fsessions%3FeventName%3DSummit%2520Europe%25202018&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
[image: Spark+AI Summit North America 2019]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-11-01 Thread Xiangrui Meng

Just FYI, not to block the release. We found an issue with PySpark barrier
execution mode: https://issues.apache.org/jira/browse/SPARK-25921. We
should list it as a known issues in the release notes and get it fixed in
2.4.1. -Xiangrui

On Thu, Nov 1, 2018 at 12:19 AM Wenchen Fan  wrote:

> This vote passes! I'll follow up with a formal release announcement soon.
>
> +1:
> Xiao Li (binding)
> Sean Owen (binding)
> Gengliang Wang
> Hyukjin Kwon
> Wenchen Fan (binding)
> Ryan Blue
> Bryan Cutler
> Marcelo Vanzin (binding)
> Reynold Xin (binding)
> Chitral Verma
> Dilip Biswal
> Denny Lee
> Felix Cheung (binding)
> Dongjoon Hyun
>
> +0:
> DB Tsai (binding)
>
> -1: None
>
> Thanks, everyone!
>
> On Thu, Nov 1, 2018 at 1:26 PM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Cheers,
>> Dongjoon.
>>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[image: Spark+AI Summit Europe]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fsessions%3FeventName%3DSummit%2520Europe%25202018&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
[image: Spark+AI Summit North America 2019]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Xiangrui Meng

t; >
>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>> > an existing Spark workload and running on this release candidate,
>>>>> then
>>>>> > reporting any regressions.
>>>>> >
>>>>> > If you're working in PySpark you can set up a virtual env and install
>>>>> > the current RC and see if anything important breaks, in the
>>>>> Java/Scala
>>>>> > you can add the staging repository to your projects resolvers and
>>>>> test
>>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>>> > you don't end up building with a out of date RC going forward).
>>>>> >
>>>>> > ===
>>>>> > What should happen to JIRA tickets still targeting 2.4.0?
>>>>> > ===
>>>>> >
>>>>> > The current list of open tickets targeted at 2.4.0 can be found at:
>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>> "Target Version/s" = 2.4.0
>>>>> >
>>>>> > Committers should look at those and triage. Extremely important bug
>>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>>> > be worked on immediately. Everything else please retarget to an
>>>>> > appropriate release.
>>>>> >
>>>>> > ==
>>>>> > But my bug isn't fixed?
>>>>> > ==
>>>>> >
>>>>> > In order to make timely releases, we will typically not hold the
>>>>> > release unless the bug in question is a regression from the previous
>>>>> > release. That being said, if there is something which is a regression
>>>>> > that has not been correctly targeted please ping me or a committer to
>>>>> > help target the issue.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Xiangrui Meng

Sean, thanks for checking! The MLlib blockers were resolved today by
reverting breaking API changes. We still have some documentation work to
wrap up. -Xiangrui

+Weichen Xu 

On Fri, Sep 21, 2018 at 6:54 AM Sean Owen  wrote:

> Yes, documentation for 2.4 has to be done before the 2.4 release. Or
> else it's not for 2.4. Likewise auditing that must happen before 2.4,
> must happen before 2.4 is released.
> "Foo for 2.4" as Blocker for 2.4 needs to be resolved for 2.4, by
> definition. Or else it's not a Blocker, not for 2.4.
>
>  I know we've had this discussion before and agree to disagree about
> the semantics. But we won't, say, release 2.4.0 and then go
> retroactively patch the 2.4.0 released docs with docs for 2.4.
>
> Really, I'm just asking if all the things those items mean to cover
> are done? even if for whatever reason the JIRA is not resolved.
>
> We have a new blocker thought, FWIW:
> https://issues.apache.org/jira/browse/SPARK-25495
> On Fri, Sep 21, 2018 at 3:02 AM Felix Cheung 
> wrote:
> >
> > I think the point is we actually need to do these validation before
> completing the release...
> >
> >
> > 
> > From: Wenchen Fan 
> > Sent: Friday, September 21, 2018 12:02 AM
> > To: Sean Owen
> > Cc: Spark dev list
> > Subject: Re: 2.4.0 Blockers, Critical, etc
> >
> > Sean thanks for checking them!
> >
> > I made one pass and re-targeted/closed some of them. Most of them are
> documentation and auditing, do we need to block the release for them?
> >
> > On Fri, Sep 21, 2018 at 6:01 AM Sean Owen  wrote:
> >>
> >> Because we're into 2.4 release candidates, I thought I'd look at
> >> what's still open and targeted at 2.4.0. I presume the Blockers are
> >> the usual umbrellas that don't themselves block anything, but,
> >> confirming, there is nothing left to do there?
> >>
> >> I think that's mostly a question for Joseph and Weichen.
> >>
> >> As ever, anyone who knows these items are a) done or b) not going to
> >> be in 2.4, go ahead and update them.
> >>
> >>
> >> Blocker:
> >>
> >> SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
> >> SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
> >> SPARK-25323 ML 2.4 QA: API: Python API coverage
> >> SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes
> >>
> >> Critical:
> >>
> >> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
> >> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
> >> SPARK-25327 Update MLlib, GraphX websites for 2.4
> >> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
> >> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration
> guide
> >>
> >> Other:
> >>
> >> SPARK-25346 Document Spark builtin data sources
> >> SPARK-25347 Document image data source in doc site
> >> SPARK-12978 Skip unnecessary final group-by when input data already
> >> clustered with group-by keys
> >> SPARK-20184 performance regression for complex/long sql when enable
> >> whole stage codegen
> >> SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
> >> SPARK-15693 Write schema definition out for file-based data sources to
> >> avoid schema inference
> >> SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
> >> SPARK-25179 Document the features that require Pyarrow 0.10
> >> SPARK-25110 make sure Flume streaming connector works with Spark 2.4
> >> SPARK-21318 The exception message thrown by `lookupFunction` is
> ambiguous.
> >> SPARK-24464 Unit tests for MLlib's Instrumentation
> >> SPARK-23197 Flaky test:
> spark.streaming.ReceiverSuite."receiver_life_cycle"
> >> SPARK-22809 pyspark is sensitive to imports with dots
> >> SPARK-22739 Additional Expression Support for Objects
> >> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> >> list of structures
> >> SPARK-21030 extend hint syntax to support any expression for Python and
> R
> >> SPARK-22386 Data Source V2 improvements
> >> SPARK-15117 Generate code that get a value in each compressed column
> >> from CachedBatch when DataFrame.cache() is called
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng

t;>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>>>>>> inner range optimization) but I think it could be useful to others too.
>>>>>> >
>>>>>> > It is finished and is ready to be merged (was ready a month ago at
>>>>>> least).
>>>>>> >
>>>>>> > Do you think you could consider including it in 2.4?
>>>>>> >
>>>>>> > Petar
>>>>>> >
>>>>>> >
>>>>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>>>>> >
>>>>>> >> I went through the open JIRA tickets and here is a list that we
>>>>>> should consider for Spark 2.4:
>>>>>> >>
>>>>>> >> High Priority:
>>>>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>>>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>>>>> only has a few remaining works and I think we should have it in Spark 
>>>>>> 2.4.
>>>>>> >>
>>>>>> >> Middle Priority:
>>>>>> >> SPARK-23899: Built-in SQL Function Improvement
>>>>>> >> We've already added a lot of built-in functions in this release,
>>>>>> but there are a few useful higher-order functions in progress, like
>>>>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>>>>> Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>>>>> >> Very close to finishing, great to have it in Spark 2.4.
>>>>>> >>
>>>>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>>>>> >> This one is there for years (thanks for your patience Michael!),
>>>>>> and is also close to finishing. Great to have it in 2.4.
>>>>>> >>
>>>>>> >> SPARK-24882: data source v2 API improvement
>>>>>> >> This is to improve the data source v2 API based on what we learned
>>>>>> during this release. From the migration of existing sources and design of
>>>>>> new features, we found some problems in the API and want to address 
>>>>>> them. I
>>>>>> believe this should be
>>>>>> >> the last significant API change to data source v2, so great to
>>>>>> have in Spark 2.4. I'll send a discuss email about it later.
>>>>>> >>
>>>>>> >> SPARK-24252: Add catalog support in Data Source V2
>>>>>> >> This is a very important feature for data source v2, and is
>>>>>> currently being discussed in the dev list.
>>>>>> >>
>>>>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>>>>> >> Most of it is done, but date/timestamp support is still missing.
>>>>>> Great to have in 2.4.
>>>>>> >>
>>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>>>>> answers
>>>>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>>>>> >>
>>>>>> >> There are some other important features like the adaptive
>>>>>> execution, streaming SQL, etc., not in the list, since I think we are not
>>>>>> able to finish them before 2.4.
>>>>>> >>
>>>>>> >> Feel free to add more things if you think they are important to
>>>>>> Spark 2.4 by replying to this email.
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> Wenchen
>>>>>> >>
>>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen 
>>>>>> wrote:
>>>>>> >>
>>>>>> >>   In theory releases happen on a time-based cadence, so it's
>>>>>> pretty much wrap up what's ready by the code freeze and ship it. In
>>>>>> practice, the cadence slips frequently, and it's very much a negotiation
>>>>>> about what features should push the
>>>>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>>>>> approach here that works OK.
>>>>>> >>
>>>>>> >>   Certainly speak up if you think there's something that really
>>>>>> needs to get into 2.4. This is that discuss thread.
>>>>>> >>
>>>>>> >>   (BTW I updated the page you mention just yesterday, to reflect
>>>>>> the plan suggested in this thread.)
>>>>>> >>
>>>>>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>>>>>  wrote:
>>>>>> >>
>>>>>> >>   Shouldn't this be a discuss thread?
>>>>>> >>
>>>>>> >>   I'm also happy to see more release managers and agree the time
>>>>>> is getting close, but we should see what features are in progress and see
>>>>>> how close things are and propose a date based on that.  Cutting a branch 
>>>>>> to
>>>>>> soon just creates
>>>>>> >>   more work for committers to push to more branches.
>>>>>> >>
>>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the
>>>>>> code freeze and release branch cut mid-august.
>>>>>> >>
>>>>>> >>   Tom
>>>>>> >
>>>>>> >
>>>>>> -
>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>> >
>>>>>>
>>>>>>
>>>>
>> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[SPARK-24579] SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-06-18 Thread Xiangrui Meng

Hi all,

I posted a new SPIP on optimized data exchange between Spark and DL/AI
frameworks at SPARK-24579
<https://issues.apache.org/jira/browse/SPARK-24579>. It took inputs from
offline conversations with several Spark committers and contributors at
Spark+AI summit conference. Please take a look and let me know your
thoughts in JIRA comments. Thanks!

Best,
Xiangrui
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng

+1 from myself.

The vote passed with the following +1s:

* Susham kumar reddy Yerabolu
* Xingbo Jiang
* Xiao Li*
* Weichen Xu
* Joseph Bradley*
* Henry Robinson
* Xiangrui Meng*
* Wenchen Fan*

Henry, you can find a design sketch at
https://issues.apache.org/jira/browse/SPARK-24375. To help discuss the
design, Xingbo submitted a prototype PR today at
https://github.com/apache/spark/pull/21494.

Best,
Xiangrui

On Mon, Jun 4, 2018 at 12:41 PM Wenchen Fan  wrote:

> +1
>
> On Tue, Jun 5, 2018 at 1:20 AM, Henry Robinson  wrote:
>
>> +1
>>
>> (I hope there will be a fuller design document to review, since the SPIP
>> is really light on details).
>>
>> On 4 June 2018 at 10:17, Joseph Bradley  wrote:
>>
>>> +1
>>>
>>> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>>>>>
>>>>>> +1
>>>>>>
>>>>>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I want to call for a vote of SPARK-24374
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-24374>. It introduces
>>>>>>> a new execution mode to Spark, which would help both integration with
>>>>>>> external DL/AI frameworks and MLlib algorithm performance. This is one 
>>>>>>> of
>>>>>>> the follow-ups from a previous discussion on dev@
>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Integrating-ML-DL-frameworks-with-Spark-td23913.html>
>>>>>>> .
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>> technical reasons.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xiangrui
>>>>>>> --
>>>>>>>
>>>>>>> Xiangrui Meng
>>>>>>>
>>>>>>> Software Engineer
>>>>>>>
>>>>>>> Databricks Inc. [image: http://databricks.com]
>>>>>>> <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Xiangrui Meng

+1.

On Thu, May 31, 2018 at 2:28 PM Joseph Bradley 
wrote:

> Hossein might be slow to respond (OOO), but I just commented on the JIRA.
> I'd recommend we follow the same process as the SparkR package.
>
> +1 on this from me (and I'll be happy to help shepherd it, though Felix
> and Shivaram are the experts in this area).  CRAN presents challenges, but
> this is a good step towards making R a first-class citizen for ML use cases
> of Spark.
>
> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Hossein -- Can you clarify what the resolution on the repository /
>> release issue discussed on SPIP ?
>>
>> Shivaram
>>
>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>> wrote:
>> > +1
>> > With my concerns in the SPIP discussion.
>> >
>> > 
>> > From: Hossein 
>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>> > To: dev@spark.apache.org
>> > Subject: [VOTE] SPIP ML Pipelines in R
>> >
>> > Hi,
>> >
>> > I started discussion thread for a new R package to expose MLlib
>> pipelines in
>> > R.
>> >
>> > To summarize we will work on utilities to generate R wrappers for MLlib
>> > pipeline API for a new R package. This will lower the burden for
>> exposing
>> > new API in future.
>> >
>> > Following the SPIP process, I am proposing the SPIP for a vote.
>> >
>> > +1: Let's go ahead and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I do not think this is a good idea for the following reasons.
>> >
>> > Thanks,
>> > --Hossein
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xiangrui Meng

Hi all,

I want to call for a vote of SPARK-24374
<https://issues.apache.org/jira/browse/SPARK-24374>. It introduces a new
execution mode to Spark, which would help both integration with external
DL/AI frameworks and MLlib algorithm performance. This is one of the
follow-ups from a previous discussion on dev@
<http://apache-spark-developers-list.1001551.n3.nabble.com/Integrating-ML-DL-frameworks-with-Spark-td23913.html>
.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Best,
Xiangrui
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Integrating ML/DL frameworks with Spark

2018-05-23 Thread Xiangrui Meng

Hi all,

Thanks for your feedback! I uploaded a SPIP doc for the barrier scheduling
feature at https://issues.apache.org/jira/browse/SPARK-24374. Please take a
look and leave your comments there. I had some offline discussion with +Xingbo
Jiang  to help me design the APIs. He is quite
familiar with Spark job scheduler and he will share some design ideas on
the JIRA.

I will work on SPIPs for the other two proposals: 1) fast data exchange, 2)
accelerator-aware scheduling. I definitely need some help for the second
one because I'm not familiar with YARN/Mesos/k8s.

Best,
Xiangrui

On Sun, May 20, 2018 at 8:19 PM Felix Cheung 
wrote:

> Very cool. We would be very interested in this.
>
> What is the plan forward to make progress in each of the three areas?
>
>
> --
> *From:* Bryan Cutler 
> *Sent:* Monday, May 14, 2018 11:37:20 PM
> *To:* Xiangrui Meng
> *Cc:* Reynold Xin; dev
>
> *Subject:* Re: Integrating ML/DL frameworks with Spark
> Thanks for starting this discussion, I'd also like to see some
> improvements in this area and glad to hear that the Pandas UDFs / Arrow
> functionality might be useful.  I'm wondering if from your initial
> investigations you found anything lacking from the Arrow format or possible
> improvements that would simplify the data representation?  Also, while data
> could be handed off in a UDF, would it make sense to also discuss a more
> formal way to externalize the data in a way that would also work for the
> Scala API?
>
> Thanks,
> Bryan
>
> On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng  wrote:
>
>> Shivaram: Yes, we can call it "gang scheduling" or "barrier
>> synchronization". Spark doesn't support it now. The proposal is to have a
>> proper support in Spark's job scheduler, so we can integrate well with
>> MPI-like frameworks.
>>
>>
>> On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:
>>
>>> .how I skipped the last part
>>>
>>> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin 
>>> wrote:
>>>
>>>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>>>> wrote wasn't it?
>>>>
>>>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>>>>
>>>>> besides that, one of the things which is needed by multiple frameworks
>>>>> is to schedule tasks in a single wave
>>>>>
>>>>> i.e.
>>>>>
>>>>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
>>>>> Spark is desired to provide a capability to ensure that either we run 50
>>>>> tasks at once, or we should quit the complete application/job after some
>>>>> timeout period
>>>>>
>>>>> Best,
>>>>>
>>>>> Nan
>>>>>
>>>>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
>>>>> wrote:
>>>>>
>>>>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>>>>> single task, retry the entire stage, and the entire stage of tasks need 
>>>>>> to
>>>>>> be scheduled all at once.
>>>>>>
>>>>>>
>>>>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>- Fault tolerance and execution model: Spark assumes
>>>>>>>>>fine-grained task recovery, i.e. if something fails, only that 
>>>>>>>>> task is
>>>>>>>>>rerun. This doesn’t match the execution model of distributed ML/DL
>>>>>>>>>frameworks that are typically MPI-based, and rerunning a single 
>>>>>>>>> task would
>>>>>>>>>lead to the entire system hanging. A whole stage needs to be 
>>>>>>>>> re-run.
>>>>>>>>>
>>>>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>>>>> but also useful for scaling MLlib algorithms. One of my earliest 
>>>>>>>> attempts
>>>>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended
>>>>>>>> up with some compromised solutions. With the new execution model, we 
>>>>>>>> can
>>>>>>>> set up a hybrid cluster and do all-reduce properly.
>>>>>>>>
>>>>>>>>
>>>>>>> Is there a particular new execution model you are referring to or do
>>>>>>> we plan to investigate a new execution model ?  For the MPI-like model, 
>>>>>>> we
>>>>>>> also need gang scheduling (i.e. schedule all tasks at once or none of 
>>>>>>> them)
>>>>>>> and I dont think we have support for that in the scheduler right now.
>>>>>>>
>>>>>>>>
>>>>>>>>> --
>>>>>>>>
>>>>>>>> Xiangrui Meng
>>>>>>>>
>>>>>>>> Software Engineer
>>>>>>>>
>>>>>>>> Databricks Inc. [image: http://databricks.com]
>>>>>>>> <http://databricks.com/>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>
>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Integrating ML/DL frameworks with Spark

2018-05-09 Thread Xiangrui Meng

Shivaram: Yes, we can call it "gang scheduling" or "barrier
synchronization". Spark doesn't support it now. The proposal is to have a
proper support in Spark's job scheduler, so we can integrate well with
MPI-like frameworks.

On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:

> .how I skipped the last part
>
> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:
>
>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>> wrote wasn't it?
>>
>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>>
>>> besides that, one of the things which is needed by multiple frameworks
>>> is to schedule tasks in a single wave
>>>
>>> i.e.
>>>
>>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
>>> Spark is desired to provide a capability to ensure that either we run 50
>>> tasks at once, or we should quit the complete application/job after some
>>> timeout period
>>>
>>> Best,
>>>
>>> Nan
>>>
>>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
>>> wrote:
>>>
>>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>>> single task, retry the entire stage, and the entire stage of tasks need to
>>>> be scheduled all at once.
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>
>>>>>
>>>>>>
>>>>>>>- Fault tolerance and execution model: Spark assumes
>>>>>>>fine-grained task recovery, i.e. if something fails, only that task 
>>>>>>> is
>>>>>>>rerun. This doesn’t match the execution model of distributed ML/DL
>>>>>>>frameworks that are typically MPI-based, and rerunning a single task 
>>>>>>> would
>>>>>>>lead to the entire system hanging. A whole stage needs to be re-run.
>>>>>>>
>>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts
>>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>>>> with some compromised solutions. With the new execution model, we can set
>>>>>> up a hybrid cluster and do all-reduce properly.
>>>>>>
>>>>>>
>>>>> Is there a particular new execution model you are referring to or do
>>>>> we plan to investigate a new execution model ?  For the MPI-like model, we
>>>>> also need gang scheduling (i.e. schedule all tasks at once or none of 
>>>>> them)
>>>>> and I dont think we have support for that in the scheduler right now.
>>>>>
>>>>>>
>>>>>>> --
>>>>>>
>>>>>> Xiangrui Meng
>>>>>>
>>>>>> Software Engineer
>>>>>>
>>>>>> Databricks Inc. [image: http://databricks.com]
>>>>>> <http://databricks.com/>
>>>>>>
>>>>>
>>>>>
>>>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Xiangrui Meng

Thanks Reynold for summarizing the offline discussion! I added a few
comments inline. -Xiangrui

On Mon, May 7, 2018 at 5:37 PM Reynold Xin  wrote:

> Hi all,
>
> Xiangrui and I were discussing with a heavy Apache Spark user last week on
> their experiences integrating machine learning (and deep learning)
> frameworks with Spark and some of their pain points. Couple things were
> obvious and I wanted to share our learnings with the list.
>
> (1) Most organizations already use Spark for data plumbing and want to be
> able to run their ML part of the stack on Spark as well (not necessarily
> re-implementing all the algorithms but by integrating various frameworks
> like tensorflow, mxnet with Spark).
>
> (2) The integration is however painful, from the systems perspective:
>
>
>- Performance: data exchange between Spark and other frameworks are
>slow, because UDFs across process boundaries (with native code) are slow.
>This works much better now with Pandas UDFs (given a lot of the ML/DL
>frameworks are in Python). However, there might be some low hanging fruit
>gaps here.
>
> The Arrow support behind Pands UDFs can be reused to exchange data with
other frameworks. And one possibly performance improvement is to support
pipelining when supplying data to other frameworks. For example, while
Spark is pumping data from external sources into TensorFlow, TensorFlow
starts the computation on GPUs. This would significant improve speed and
resource utilization.

>
>- Fault tolerance and execution model: Spark assumes fine-grained task
>recovery, i.e. if something fails, only that task is rerun. This doesn’t
>match the execution model of distributed ML/DL frameworks that are
>typically MPI-based, and rerunning a single task would lead to the entire
>system hanging. A whole stage needs to be re-run.
>
> This is not only useful for integrating with 3rd-party frameworks, but
also useful for scaling MLlib algorithms. One of my earliest attempts in
Spark MLlib was to implement All-Reduce primitive (SPARK-1485
<https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up with
some compromised solutions. With the new execution model, we can set up a
hybrid cluster and do all-reduce properly.


>
>- Accelerator-aware scheduling: The DL frameworks leverage GPUs and
>sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
>aware of those resources, leading to either over-utilizing the accelerators
>or under-utilizing the CPUs.
>
>
> The good thing is that none of these seem very difficult to address (and
> we have already made progress on one of them). Xiangrui has graciously
> accepted the challenge to come up with solutions and SPIP to these.
>
>
I will do more home work, exploring existing JIRAs or creating new JIRAs
for the proposal. We'd like to hear your feedback and past efforts along
those directions if they were not fully captured by our JIRA.


> Xiangrui - please also chime in if I didn’t capture everything.
>
>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Welcoming Yanbo Liang as a committer

2016-06-07 Thread Xiangrui Meng

Congrats!!

On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali 
wrote:

> Congratulations Yanbo Liang! Well deserved.
>
>
> On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Congrats, Yanbo!
>>
>> On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin  wrote:
>>
>>> Congratulations Yanbo!
>>>
>>> On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler  wrote:
>>>
 Congratulations Yanbo!
 On Jun 5, 2016 4:03 AM, "Kousuke Saruta" 
 wrote:

> Congratulations Yanbo!
>
>
> - Kousuke
>
> On 2016/06/04 11:48, Matei Zaharia wrote:
>
>> Hi all,
>>
>> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has
>> been a super active contributor in many areas of MLlib. Please join me in
>> welcoming Yanbo!
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>>>
>>
>

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng

We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the
latest branch-2.0, there could be an issue with your SparkR installation.
Did you try `R/install-dev.sh`?

On Thu, May 19, 2016 at 11:42 AM Gayathri Murali <
gayathri.m.sof...@gmail.com> wrote:

> This is on Spark 2.0. I see the following on the unit-tests.log when I run
> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased
> master. R version is 3.3.0.
>
> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor:
> Exception in task 0.0 in stage 5186.0 (TID 10370)
> 1384595 org.apache.spark.SparkException: R computation failed with
> 1384596
> 1384597 Execution halted
> 1384598
> 1384599 Execution halted
> 1384600
> 1384601 Execution halted
> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384604 at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384606 at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
> 1384608 at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384609 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384610 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384611 at java.lang.Thread.run(Thread.java:745)
> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped
> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped
> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI
> at http://localhost:4040
> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR
> Executor: Exception in task 1.0 in stage 5186.0 (TID 10371)
> 1384616 org.apache.spark.SparkException: R computation failed with
> 1384617
> 1384618 Execution halted
> 1384619
> 1384620 Execution halted
> 1384621
> 1384622 Execution halted
> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384625 at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384630 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384631 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384632 at java.lang.Thread.run(Thread.java:745)
> 1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler: Job
> 5183 failed: collect at null:-1, took 0.211674 s
> 1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR RBackendHandler:
> collect on 26345 failed
> 1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage 5186
> (collect at null:-1) failed in 0.210 s
> 1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus:
> SparkListenerBus has already stopped! Dropping event
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageIn
>  fo@413da307)
> 1384637 16/05/19 11:28:13.878 Thread-1 ERROR LiveListenerBus:
> SparkListenerBus has already stopped! Dropping event
> SparkListenerJobEnd(5183,1463682493877,JobFailed(org.apache.sp
>  ark.SparkException: Job 5183 cancelled because SparkContext was shut down))
> 1384638 16/05/19 11:28:13.880 dispatcher-event-loop-1 INFO
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 1384639 16/05/19 11:28:13.904 Thread-1 INFO MemoryStore: MemoryStore
> cleared
> 1384640 16/05/19 11:28:13.904 Thread-1 INFO BlockManager: BlockManager
> stopped
> 1384641 16/05/19 11:28:13.904 Thread-1 INFO BlockManagerMaster:
> BlockManagerMaster stopped
> 1384642 16/05/19 11:28:13.905 dispatcher-event-loop-0 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully
> stopped SparkContext
> 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown
> hook called
> 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting
> directory
> /private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915
>1c8231
>
>
> On Thu, May 19, 2016 at 8:46 A

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiangrui Meng

+1

On Thu, May 19, 2016 at 9:18 AM Joseph Bradley 
wrote:

> +1
>
> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin  wrote:
>
>> Hi Ovidiu-Cristian ,
>>
>> The best source of truth is change the filter with target version to
>> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
>> get closer to 2.0 release, more will be retargeted at 2.1.0.
>>
>>
>>
>> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Yes, I can filter..
>>> Did that and for example:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
>>> 
>>>
>>> To rephrase: for 2.0 do you have specific issues that are not a priority
>>> and will released maybe with 2.1 for example?
>>>
>>> Keep up the good work!
>>>
>>> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>>>
>>> You can find that by changing the filter to target version = 2.0.0.
>>> Cheers.
>>>
>>> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 +1 Great, I see the list of resolved issues, do you have a list of
 known issue you plan to stay with this release?

 with
 build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
 -Phive-thriftserver -DskipTests clean package

 mvn -version
 Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
 2015-11-10T17:41:47+01:00)
 Maven home: /Users/omarcu/tools/apache-maven-3.3.9
 Java version: 1.7.0_80, vendor: Oracle Corporation
 Java home:
 /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"

 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM ... SUCCESS [
 2.635 s]
 [INFO] Spark Project Tags . SUCCESS [
 1.896 s]
 [INFO] Spark Project Sketch ... SUCCESS [
 2.560 s]
 [INFO] Spark Project Networking ... SUCCESS [
 6.533 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
 4.176 s]
 [INFO] Spark Project Unsafe ... SUCCESS [
 4.809 s]
 [INFO] Spark Project Launcher . SUCCESS [
 6.242 s]
 [INFO] Spark Project Core . SUCCESS
 [01:20 min]
 [INFO] Spark Project GraphX ... SUCCESS [
 9.148 s]
 [INFO] Spark Project Streaming  SUCCESS [
 22.760 s]
 [INFO] Spark Project Catalyst . SUCCESS [
 50.783 s]
 [INFO] Spark Project SQL .. SUCCESS
 [01:05 min]
 [INFO] Spark Project ML Local Library . SUCCESS [
 4.281 s]
 [INFO] Spark Project ML Library ... SUCCESS [
 54.537 s]
 [INFO] Spark Project Tools  SUCCESS [
 0.747 s]
 [INFO] Spark Project Hive . SUCCESS [
 33.032 s]
 [INFO] Spark Project HiveContext Compatibility  SUCCESS [
 3.198 s]
 [INFO] Spark Project REPL . SUCCESS [
 3.573 s]
 [INFO] Spark Project YARN Shuffle Service . SUCCESS [
 4.617 s]
 [INFO] Spark Project YARN . SUCCESS [
 7.321 s]
 [INFO] Spark Project Hive Thrift Server ... SUCCESS [
 16.496 s]
 [INFO] Spark Project Assembly . SUCCESS [
 2.300 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [
 4.219 s]
 [INFO] Spark Project External Flume ... SUCCESS [
 6.987 s]
 [INFO] Spark Project External Flume Assembly .. SUCCESS [
 1.465 s]
 [INFO] Spark Integration for Kafka 0.8  SUCCESS [
 6.891 s]
 [INFO] Spark Project Examples . SUCCESS [
 13.465 s]
 [INFO] Spark Project External Kafka Assembly .. SUCCESS [
 2.815 s]
 [INFO]
 
 [INFO] BUILD SUCCESS
 [INFO]
 
 [INFO] Total time: 07:04 min
 [INFO] Finished at: 2016-05-18T17:55:33+02:00
 [INFO] Final Memory: 90M/824M
 [INFO]
 

 On 18 May 2016, at 16:28, Sean Owen  wrote:

 I t

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng

Is it on 1.6.x?

On Wed, May 18, 2016, 6:57 PM Sun Rui  wrote:

> I saw it, but I can’t see the complete error message on it.
> I mean the part after “error in invokingJava(…)”
>
> On May 19, 2016, at 08:37, Gayathri Murali 
> wrote:
>
> There was a screenshot attached to my original email. If you did not get
> it, attaching here again.
>
> On Wed, May 18, 2016 at 5:27 PM, Sun Rui  wrote:
>
>> It’s wrong behaviour that head(df) outputs no row
>> Could you send a screenshot displaying whole error message?
>>
>> On May 19, 2016, at 08:12, Gayathri Murali 
>> wrote:
>>
>> I am trying to run a basic example on Interactive R shell and run into
>> the following error. Also note that head(df) does not display any rows. Can
>> someone please help if I am missing something?
>>
>> 
>>
>>  Thanks
>> Gayathri
>>
>> [image: 提示图标] 邮件带有附件预览链接，若您转发或回复此邮件时不希望对方预览附件，建议您手动删除链接。
>> 共有 *1* 个附件
>> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)极速下载
>> 
>>  在线预览
>> 
>>
>>
>>
> [image: 提示图标] 邮件带有附件预览链接，若您转发或回复此邮件时不希望对方预览附件，建议您手动删除链接。
> 共有 *1* 个附件
> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)极速下载
> 
>  在线预览
> 
> 
>
>
>

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng

Not exacly the same as the one you suggested but you can chain it with
flatMap to get what you want, if each file is not huge.

On Thu, May 19, 2016, 8:41 AM Xiangrui Meng  wrote:

> This was implemented as sc.wholeTextFiles.
>
> On Thu, May 19, 2016, 2:43 AM Reynold Xin  wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>> val conf = sc.hadoopConfiguration
>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>   map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng

This was implemented as sc.wholeTextFiles.

On Thu, May 19, 2016, 2:43 AM Reynold Xin  wrote:

> Users would be able to run this already with the 3 lines of code you
> supplied right? In general there are a lot of methods already on
> SparkContext and we lean towards the more conservative side in introducing
> new API variants.
>
> Note that this is something we are doing automatically in Spark SQL for
> file sources (Dataset/DataFrame).
>
>
> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov  > wrote:
>
>> Hello Everyone
>>
>> Do you think it would be useful to add combinedTextFile method (which
>> uses CombineTextInputFormat) to SparkContext?
>>
>> It allows one task to read data from multiple text files and control
>> number of RDD partitions by setting
>> mapreduce.input.fileinputformat.split.maxsize
>>
>>
>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>> val conf = sc.hadoopConfiguration
>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>> classOf[LongWritable], classOf[Text], conf).
>>   map(pair => pair._2.toString).setName(path)
>>   }
>>
>>
>> Alex
>>
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng

Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to
port over in order to reach feature parity. -Xiangrui

On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
>
> Thanks
> Shivaram
>
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
> API has
> >> been developed under the spark.ml package, while the old RDD-based API
> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API
> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the
> development
> >> and the usage gradually shifting to the DataFrame-based API. Just
> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and
> to
> >> help users migrate over sooner, I want to propose switching RDD-based
> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package,
> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x
> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will
> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark
> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib
> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there
> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng

Hi all,

More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API
has been developed under the spark.ml package, while the old RDD-based API
has been developed in parallel under the spark.mllib package. While it was
easier to implement and experiment with new APIs under a new package, it
became harder and harder to maintain as both packages grew bigger and
bigger. And new users are often confused by having two sets of APIs with
overlapped functions.

We started to recommend the DataFrame-based API over the RDD-based API in
Spark 1.5 for its versatility and flexibility, and we saw the development
and the usage gradually shifting to the DataFrame-based API. Just counting
the lines of Scala code, from 1.5 to the current master we added ~1
lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
gather more resources on the development of the DataFrame-based API and to
help users migrate over sooner, I want to propose switching RDD-based MLlib
APIs to maintenance mode in Spark 2.0. What does it mean exactly?

* We do not accept new features in the RDD-based spark.mllib package,
unless they block implementing new features in the DataFrame-based spark.ml
package.
* We still accept bug fixes in the RDD-based API.
* We will add more features to the DataFrame-based API in the 2.x series to
reach feature parity with the RDD-based API.
* Once we reach feature parity (possibly in Spark 2.2), we will deprecate
the RDD-based API.
* We will remove the RDD-based API from the main Spark repo in Spark 3.0.

Though the RDD-based API is already in de facto maintenance mode, this
announcement will make it clear and hence important to both MLlib
developers and users. So we’d greatly appreciate your feedback!

(As a side note, people sometimes use “Spark ML” to refer to the
DataFrame-based API or even the entire MLlib component. This also causes
confusion. To be clear, “Spark ML” is not an official name and there are no
plans to rename MLlib to “Spark ML” at this time.)

Best,
Xiangrui

Re: Various forks

2016-03-19 Thread Xiangrui Meng

We made that fork to hide package private classes/members in the generated
Java API doc. Otherwise, the Java API doc is very messy. The patch is to
map all private[*] to the default scope in the generated Java code.
However, this might not be the expected behavior for other packages. So it
didn't get merged into the official genjavadoc repo. The proposal is to
have a flag in genjavadoc settings to enable this mapping, but it was
delayed. This is the JIRA for this issue:
https://issues.apache.org/jira/browse/SPARK-7992. -Xiangrui

On Tue, Mar 15, 2016 at 10:50 AM Reynold Xin  wrote:

> +Xiangrui
>
> On Tue, Mar 15, 2016 at 10:24 AM, Sean Owen  wrote:
>
>> Picking up this old thread, since we have the same problem updating to
>> Scala 2.11.8
>>
>> https://github.com/apache/spark/pull/11681#issuecomment-196932777
>>
>> We can see the org.spark-project packages here:
>>
>> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.spark-project%22
>>
>> I've forgotten who maintains the custom fork builds, and I don't know
>> the reasons we needed a fork of genjavadoc. Is it still relevant?
>>
>> Heh, there's no plugin for 2.11.8 from the upstream project either anyway:
>>
>> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.typesafe.genjavadoc%22
>>
>> This may be blocked for now
>>
>> On Thu, Jun 25, 2015 at 2:18 PM, Iulian Dragoș
>>  wrote:
>> > Could someone point the source of the Spark-fork used to build
>> > genjavadoc-plugin? Even more important it would be to know the reasoning
>> > behind this fork.
>> >
>> > Ironically, this hinders my attempts at removing another fork, the Spark
>> > REPL fork (and the upgrade to Scala 2.11.7). See here. Since genjavadoc
>> is a
>> > compiler plugin, it is cross-compiled with the full Scala version,
>> meaning
>> > someone needs to publish a new version for 2.11.7.
>> >
>> > Ideally, we'd have a list of all forks maintained by the Spark project.
>> I
>> > know about:
>> >
>> > - org.spark-project/akka
>> > - org.spark-project/hive
>> > - org.spark-project/genjavadoc-plugin
>> >
>> > Are there more? Where are they hosted, and what's the release process
>> around
>> > them?
>> >
>> > thanks,
>> > iulian
>> >
>> > --
>> >
>> > --
>> > Iulian Dragos
>> >
>> > --
>> > Reactive Apps on the JVM
>> > www.typesafe.com
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Xiangrui Meng

+1. Checked user guide and API doc, and ran some MLlib and SparkR
examples. -Xiangrui

On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin  wrote:
> I'm going to +1 this myself. Tested on my laptop.
>
>
>
> On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin  wrote:
>>
>> I forked a new thread for this. Please discuss NOTICE file related things
>> there so it doesn't hijack this thread.
>>
>>
>> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen  wrote:
>>>
>>> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas 
>>> wrote:
>>> > Under your guidance, I would be happy to help compile a NOTICE file
>>> > which
>>> > follows the pattern used by Derby and the JDK. This effort might
>>> > proceed in
>>> > parallel with vetting 1.5.1 and could be targeted at a later release
>>> > vehicle. I don't think that the ASF's exposure is greatly increased by
>>> > one
>>> > more release which follows the old pattern.
>>>
>>> I'd prefer to use the ASF's preferred pattern, no? That's what we've
>>> been trying to do and seems like we're even required to do so, not
>>> follow a different convention. There is some specific guidance there
>>> about what to add, and not add, to these files. Specifically, because
>>> the AL2 requires downstream projects to embed the contents of NOTICE,
>>> the guidance is to only include elements in NOTICE that must appear
>>> there.
>>>
>>> Put it this way -- what would you like to change specifically? (you
>>> can start another thread for that)
>>>
>>> >> My assessment (just looked before I saw Sean's email) is the same as
>>> >> his. The NOTICE file embeds other projects' licenses.
>>> >
>>> > This may be where our perspectives diverge. I did not find those
>>> > licenses
>>> > embedded in the NOTICE file. As I see it, the licenses are cited but
>>> > not
>>> > included.
>>>
>>> Pretty sure that was meant to say that NOTICE embeds other projects'
>>> "notices", not licenses. And those notices can have all kinds of
>>> stuff, including licenses.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Are These Issues Suitable for our Senior Project?

2015-07-09 Thread Xiangrui Meng

Hi Emrehan,

Thanks for asking! There are actually many TODOs for MLlib. I would
recommend starting with small tasks before picking a topic for your
senior project. Please check
https://issues.apache.org/jira/browse/SPARK-8445 for the 1.5 roadmap
and see whether there are ones you are interested in. Thanks!

Best,
Xiangrui

On Thu, Jul 9, 2015 at 1:20 PM, Feynman Liang  wrote:
> Exciting, thanks for the contribution! I'm currently aware of:
>
> SPARK-8499 is currently in progress (in a duplicate issue); I updated the
> JIRA to reflect that.
> SPARK-5992 has a spark package linked but I'm unclear on whether there is
> any progress there.
>
> Feynman
>
> On Thu, Jul 9, 2015 at 1:04 PM, emrehan  wrote:
>>
>> Hi all,
>>
>> We could contribute to a feature to Spark MLlib by May 2016 and make it
>> count as our undergraduate senior project. The following list of issues
>> seem
>> interesting to us:
>>
>> *  https://issues.apache.org/jira/browse/SPARK-2273
>> –  Online learning
>> algorithms: Passive Aggressive
>> *  https://issues.apache.org/jira/browse/SPARK-2335
>> –  K-Nearest
>> Neighbor
>> classification and regression for MLLib
>> *  https://issues.apache.org/jira/browse/SPARK-2401
>> –  AdaBoost.MH, a
>> multi-class multi-label classifier
>> *  https://issues.apache.org/jira/browse/SPARK-4251
>> –  Add Restricted
>> Boltzmann machine(RBM) algorithm to MLlib
>> *  https://issues.apache.org/jira/browse/SPARK-4752
>> –  Classifier based
>> on
>> artificial neural network
>> *  https://issues.apache.org/jira/browse/SPARK-5575
>> –  Artificial neural
>> networks for MLlib deep learning
>> *  https://issues.apache.org/jira/browse/SPARK-5992
>> –  Locality
>> Sensitive
>> Hashing (LSH) for MLlib
>> *  https://issues.apache.org/jira/browse/SPARK-6425
>> –  Add parallel
>> Q-learning algorithm to MLLib
>> *  https://issues.apache.org/jira/browse/SPARK-6442
>> –  Local Linear
>> Algebra Package
>> *  https://issues.apache.org/jira/browse/SPARK-8499
>> –  NaiveBayes
>> implementation for MLPipeline
>>
>> All of these tickets are marked unassigned but have some work done on
>> them.
>> Are any of these issues are unsuitable for us as a senior project?
>>
>> Kind regards,
>> Can Giracoglu, Emrehan Tuzun, Remzi Can Aksoy, Saygin Dogu
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Are-These-Issues-Suitable-for-our-Senior-Project-tp13119.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-18 Thread Xiangrui Meng

Hi Yu,

Reducing the code complexity on the Python side is certainly what we
want to see:) We didn't call Java directly in Python models because
Java methods don't work inside RDD closures, e.g.,

rdd.map(lambda x: model.predict(x[1]))

But I agree that for model save/load the implementation should be
simplified. Could you submit a PR and see how much code we can save?

Thanks,
Xiangrui

On Wed, Jun 17, 2015 at 8:15 PM, Yu Ishikawa
 wrote:
> Hi all,
>
> I think we should refactor some machine learning model classes in Python to
> reduce the software maintainability.
> Inheriting JavaModelWrapper class, we can easily and directly call Scala API
> for the model without PythonMLlibAPI.
>
> In some case, a machine learning model class in Python has complicated
> variables. That is, it is a little hard to implement import/export methods
> and it is also a little troublesome to implement the function in both of
> Scala and Python. And I think standardizing how to create a model class in
> python is important.
>
> What do you think about that?
>
> Thanks,
> Yu
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-17 Thread Xiangrui Meng

Hi Eron,

Please register your Spark Package on http://spark-packages.org, which
helps users find your work. Do you have some performance benchmark to
share? Thanks!

Best,
Xiangrui

On Wed, Jun 10, 2015 at 10:48 PM, Nick Pentreath
 wrote:
> Looks very interesting, thanks for sharing this.
>
> I haven't had much chance to do more than a quick glance over the code.
> Quick question - are the Word2Vec and GLOVE implementations fully parallel
> on Spark?
>
> On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright  wrote:
>>
>>
>> The deeplearning4j framework provides a variety of distributed, neural
>> network-based learning algorithms, including convolutional nets, deep
>> auto-encoders, deep-belief nets, and recurrent nets.  We’re working on
>> integration with the Spark ML pipeline, leveraging the developer API.   This
>> announcement is to share some code and get feedback from the Spark
>> community.
>>
>> The integration code is located in the dl4j-spark-ml module in the
>> deeplearning4j repository.
>>
>> Major aspects of the integration work:
>>
>> ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we
>> developed a new classifier and a new unsupervised learning estimator.
>> ML attributes. We strove to interoperate well with other pipeline
>> components.   ML Attributes are column-level metadata enabling information
>> sharing between pipeline components.See here how the classifier reads
>> label metadata from a column provided by the new StringIndexer.
>> Large binary data.  It is challenging to work with large binary data in
>> Spark.   An effective approach is to leverage PrunedScan and to carefully
>> control partition sizes.  Here we explored this with a custom data source
>> based on the new relation API.
>> Column-based record readers.  Here we explored how to construct rows from
>> a Hadoop input split by composing a number of column-level readers, with
>> pruning support.
>> UDTs.   With Spark SQL it is possible to introduce new data types.   We
>> prototyped an experimental Tensor type, here.
>> Spark Package.   We developed a spark package to make it easy to use the
>> dl4j framework in spark-shell and with spark-submit.  See the
>> deeplearning4j/dl4j-spark-ml repository for useful snippets involving the
>> sbt-spark-package plugin.
>> Example code.   Examples demonstrate how the standardized ML API
>> simplifies interoperability, such as with label preprocessing and feature
>> scaling.   See the deeplearning4j/dl4j-spark-ml-examples repository for an
>> expanding set of example pipelines.
>>
>> Hope this proves useful to the community as we transition to exciting new
>> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with
>> multiple GPUs on AWS and we're looking forward to optimizations that will
>> speed neural net training even more.
>>
>> Eron Wright
>> Contributor | deeplearning4j.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Xiangrui Meng

On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
 wrote:
> I dont know much about Python style, but I think the point Wes made about
> usability on the JIRA is pretty powerful. IMHO the number of methods on a
> Spark DataFrame might not be much more compared to Pandas. Given that it
> looks like users are okay with the possibility of collisions in Pandas I
> think sticking (1) is not a bad idea.
>

This is true for interactive work. Spark's DataFrames can handle
really large datasets, which might be used in production workflows. So
I think it is reasonable for us to care more about compatibility
issues than Pandas.

> Also is it possible to detect such collisions in Python ? A (4)th option
> might be to detect that `df` contains a column named `name` and print a
> warning in `df.name` which tells the user that the method is overriding the
> column.

Maybe we can inspect the frame `df.name` gets called and warn users in
`df.select(df.name)` but not in `name = df.name`. This could be tricky
to implement.

-Xiangrui

>
> Thanks
> Shivaram
>
>
> On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng  wrote:
>>
>> Hi all,
>>
>> In PySpark, a DataFrame column can be referenced using df["abcd"]
>> (__getitem__) and df.abcd (__getattr__). There is a discussion on
>> SPARK-7035 on compatibility issues with the __getattr__ approach, and
>> I want to collect more inputs on this.
>>
>> Basically, if in the future we introduce a new method to DataFrame, it
>> may break user code that uses the same attr to reference a column or
>> silently changes its behavior. For example, if we add name() to
>> DataFrame in the next release, all existing code using `df.name` to
>> reference a column called "name" will break. If we add `name()` as a
>> property instead of a method, all existing code using `df.name` may
>> still work but with a different meaning. `df.select(df.name)` no
>> longer selects the column called "name" but the column that has the
>> same name as `df.name`.
>>
>> There are several proposed solutions:
>>
>> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
>> latter that is future proof. This is the current solution in master
>> (https://github.com/apache/spark/pull/5971). But I think users may be
>> still unaware of the compatibility issue and prefer `df.abcd` to
>> `df["abcd"]` because the former could be auto-completed.
>> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
>> JIRA page: "I actually dragged my feet on the _getattr_ issue for
>> several months back in the day, then finally added it (and tab
>> completion in IPython with _dir_), and immediately noticed a huge
>> quality-of-life improvement when using pandas for actual (esp.
>> interactive) work."
>> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
>> df["abcd"] would be future proof, and df.abcd_ could be
>> auto-completed. The tradeoff is apparently the extra "_" appearing in
>> the code.
>>
>> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
>> Thanks!
>>
>> Best,
>> Xiangrui
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Collect inputs on SPARK-7035: compatibility issue with DataFrame.getattr

2015-05-08 Thread Xiangrui Meng

Hi all,

In PySpark, a DataFrame column can be referenced using df["abcd"]
(__getitem__) and df.abcd (__getattr__). There is a discussion on
SPARK-7035 on compatibility issues with the __getattr__ approach, and
I want to collect more inputs on this.

Basically, if in the future we introduce a new method to DataFrame, it
may break user code that uses the same attr to reference a column or
silently changes its behavior. For example, if we add name() to
DataFrame in the next release, all existing code using `df.name` to
reference a column called "name" will break. If we add `name()` as a
property instead of a method, all existing code using `df.name` may
still work but with a different meaning. `df.select(df.name)` no
longer selects the column called "name" but the column that has the
same name as `df.name`.

There are several proposed solutions:

1. Keep both df.abcd and df["abcd"], and encourage users to use the
latter that is future proof. This is the current solution in master
(https://github.com/apache/spark/pull/5971). But I think users may be
still unaware of the compatibility issue and prefer `df.abcd` to
`df["abcd"]` because the former could be auto-completed.
2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
JIRA page: "I actually dragged my feet on the _getattr_ issue for
several months back in the day, then finally added it (and tab
completion in IPython with _dir_), and immediately noticed a huge
quality-of-life improvement when using pandas for actual (esp.
interactive) work."
3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
df["abcd"] would be future proof, and df.abcd_ could be
auto-completed. The tradeoff is apparently the extra "_" appearing in
the code.

My preference is 3 > 1 > 2. Your inputs would be greatly appreciated. Thanks!

Best,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] ending support for Java 6?

2015-05-05 Thread Xiangrui Meng

+1. One issue with dropping Java 6: if we use Java 7 to build the
assembly jar, it will use zip64. Could Python 2.x (or even 3.x) be
able to load zip64 files on PYTHONPATH? -Xiangrui

On Tue, May 5, 2015 at 3:25 PM, Reynold Xin  wrote:
> OK I sent an email.
>
>
> On Tue, May 5, 2015 at 2:47 PM, shane knapp  wrote:
>
>> +1 to an announce to user and dev.  java6 is so old and sad.
>>
>> On Tue, May 5, 2015 at 2:24 PM, Tom Graves  wrote:
>>
>>> +1. I haven't seen major objections here so I would say send announcement
>>> and see if any users have objections
>>>
>>> Tom
>>>
>>>
>>>
>>>   On Tuesday, May 5, 2015 5:09 AM, Patrick Wendell 
>>> wrote:
>>>
>>>
>>> If there is broad consensus here to drop Java 1.6 in Spark 1.5, should
>>> we do an ANNOUNCE to user and dev?
>>>
>>> On Mon, May 4, 2015 at 7:24 PM, shane knapp  wrote:
>>> > sgtm
>>> >
>>> > On Mon, May 4, 2015 at 11:23 AM, Patrick Wendell 
>>> wrote:
>>> >>
>>> >> If we just set JAVA_HOME in dev/run-test-jenkins, I think it should
>>> work.
>>> >>
>>> >> On Mon, May 4, 2015 at 7:20 PM, shane knapp 
>>> wrote:
>>> >> > ...and now the workers all have java6 installed.
>>> >> >
>>> >> > https://issues.apache.org/jira/browse/SPARK-1437
>>> >> >
>>> >> > sadly, the built-in jenkins jdk management doesn't allow us to
>>> choose a
>>> >> > JDK
>>> >> > version within matrix projects...  so we need to manage this stuff
>>> >> > manually.
>>> >> >
>>> >> > On Sun, May 3, 2015 at 8:57 AM, shane knapp 
>>> wrote:
>>> >> >
>>> >> >> that bug predates my time at the amplab...  :)
>>> >> >>
>>> >> >> anyways, just to restate: jenkins currently only builds w/java 7.
>>> if
>>> >> >> you
>>> >> >> folks need 6, i can make it happen, but it will be a (smallish) bit
>>> of
>>> >> >> work.
>>> >> >>
>>> >> >> shane
>>> >> >>
>>> >> >> On Sun, May 3, 2015 at 2:14 AM, Sean Owen 
>>> wrote:
>>> >> >>
>>> >> >>> Should be, but isn't what Jenkins does.
>>> >> >>> https://issues.apache.org/jira/browse/SPARK-1437
>>> >> >>>
>>> >> >>> At this point it might be simpler to just decide that 1.5 will
>>> require
>>> >> >>> Java 7 and then the Jenkins setup is correct.
>>> >> >>>
>>> >> >>> (NB: you can also solve this by setting bootclasspath to JDK 6 libs
>>> >> >>> even when using javac 7+ but I think this is overly complicated.)
>>> >> >>>
>>> >> >>> On Sun, May 3, 2015 at 5:52 AM, Mridul Muralidharan <
>>> mri...@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Hi Shane,
>>> >> >>> >
>>> >> >>> >  Since we are still maintaining support for jdk6, jenkins should
>>> be
>>> >> >>> > using jdk6 [1] to ensure we do not inadvertently use jdk7 or
>>> higher
>>> >> >>> > api which breaks source level compat.
>>> >> >>> > -source and -target is insufficient to ensure api usage is
>>> >> >>> > conformant
>>> >> >>> > with the minimum jdk version we are supporting.
>>> >> >>> >
>>> >> >>> > Regards,
>>> >> >>> > Mridul
>>> >> >>> >
>>> >> >>> > [1] Not jdk7 as you mentioned
>>> >> >>> >
>>> >> >>> > On Sat, May 2, 2015 at 8:53 PM, shane knapp >> >
>>> >> >>> wrote:
>>> >> >>> >> that's kinda what we're doing right now, java 7 is the
>>> >> >>> default/standard on
>>> >> >>> >> our jenkins.
>>> >> >>> >>
>>> >> >>> >> or, i vote we buy a butler's outfit for thomas and have a second
>>> >> >>> jenkins
>>> >> >>> >> instance...  ;)
>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: OOM error with GMMs on 4GB dataset

2015-05-05 Thread Xiangrui Meng

Did you set `--driver-memory` with spark-submit? -Xiangrui

On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni  wrote:
> Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760).
> The spark (1.3.1) job is allocated 120 executors with 6GB each and the
> driver also has 6GB.
> Spark Config Params:
>
> .set("spark.hadoop.validateOutputSpecs",
> "false").set("spark.dynamicAllocation.enabled",
> "false").set("spark.driver.maxResultSize",
> "4g").set("spark.default.parallelism", "300").set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb",
> "500").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "300")
>
> However, at the aggregate step (Line 168)
> val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _
> += _)
>
> I get OOM error and the application hangs indefinitely. Is this an issue or
> am I missing something?
> java.lang.OutOfMemoryError: Java heap space
> at akka.util.CompactByteString$.apply(ByteString.scala:410)
> at akka.util.ByteString$.apply(ByteString.scala:22)
> at
> akka.remote.transport.netty.TcpHandlers$class.onMessage(TcpSupport.scala:45)
> at
> akka.remote.transport.netty.TcpServerHandler.onMessage(TcpSupport.scala:57)
> at
> akka.remote.transport.netty.NettyServerHelpers$class.messageReceived(NettyHelpers.scala:43)
> at
> akka.remote.transport.netty.ServerHandler.messageReceived(NettyTransport.scala:180)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> at
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> at
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> at
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> 15/05/04 16:23:38 ERROR util.Utils: Uncaught exception in thread
> task-result-getter-2
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java
> heap space
> 15/05/04 16:23:45 INFO scheduler.TaskSetManager: Finished task 1070.0 in
> stage 6.0 (TID 8276) in 382069 ms on [] (160/3600)
> 15/05/04 16:23:54 WARN channel.DefaultChannelPipeline: An exception was
> thrown by a user handler while handling an exception event ([id: 0xc57da871,
> ] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
> java.lang.OutOfMemoryError: Java heap space
> 15/05/04 16:23:55 WARN channel.DefaultChannelPipeline: An exception was
> thrown by a user handler while handling an exception event ([id: 0x3c3dbb0c,
> ] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
> 15/05/04 16:24:45 ERROR actor.ActorSystemImpl: Uncaught fatal error from
> thread [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
> ActorSystem [sparkDriver]
>
>
>
> Thanks!
> Vinay

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Pickling error when attempting to add a method in pyspark

2015-05-05 Thread Xiangrui Meng

Hi Stephen,

I think it would be easier to see what you implemented by showing the
branch diff link on github. There are couple utility class to make
Rating work between Scala and Python:

1. serializer: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1163
2. mark it as picklable:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L56

However, I don't recommend you following this approach. It is much
simpler to use DataFrames for serialization. You can find an example
here:

https://github.com/apache/spark/blob/master/python/pyspark/mllib/evaluation.py#L23

Best,
Xiangrui

On Thu, Apr 30, 2015 at 12:07 PM, Stephen Boesch  wrote:
> Bumping this.  Anyone of you having some familiarity with py4j interface in
> pyspark?
>
> thanks
>
>
> 2015-04-27 22:09 GMT-07:00 Stephen Boesch :
>
>>
>> My intention is to add pyspark support for certain mllib spark methods.  I
>> have been unable to resolve pickling errors of the form
>>
>>Pyspark py4j PickleException: “expected zero arguments for
>> construction of ClassDict”
>> 
>>
>> These are occurring during python to java conversion of python named
>> tuples.  The details are rather hard to provide here so I have created an
>> SOF question
>>
>>
>> http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class
>>
>> In any case I have included the text here. The SOF is easier to read
>> though ;)
>>
>> --
>>
>> This question is directed towards persons familiar with py4j - and can
>> help to resolve a pickling error. I am trying to add a method to the
>> pyspark PythonMLLibAPI that accepts an RDD of a namedtuple, does some work,
>> and returns a result in the form of an RDD.
>>
>> This method is modeled after the PYthonMLLibAPI.trainALSModel() method,
>> whose analogous *existing* relevant portions are:
>>
>>   def trainALSModel(
>> ratingsJRDD: JavaRDD[Rating],
>> .. )
>>
>> The *existing* python Rating class used to model the new code is:
>>
>> class Rating(namedtuple("Rating", ["user", "product", "rating"])):
>> def __reduce__(self):
>> return Rating, (int(self.user), int(self.product), 
>> float(self.rating))
>>
>> Here is the attempt So here are the relevant classes:
>>
>> *New* python class pyspark.mllib.clustering.MatrixEntry:
>>
>> from collections import namedtupleclass 
>> MatrixEntry(namedtuple("MatrixEntry", ["x","y","weight"])):
>> def __reduce__(self):
>> return MatrixEntry, (long(self.x), long(self.y), float(self.weight))
>>
>> *New* method *foobarRDD* In PythonMLLibAPI:
>>
>>   def foobarRdd(
>> data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = {
>> val rdd = data.rdd.map { d => FooBarResult(d.i, d.j, d.value, d.i * 100 
>> + d.j * 10 + d.value)}
>> rdd
>>   }
>>
>> Now let us try it out:
>>
>> from pyspark.mllib.clustering import MatrixEntry
>> def convert_to_MatrixEntry(tuple):
>>   return MatrixEntry(*tuple)
>> from pyspark.mllib.clustering import *
>> pic = PowerIterationClusteringModel(2)
>> tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)]
>> trdd = sc.parallelize(map(convert_to_MatrixEntry,tups))
>> # print out the RDD on python side just for validationprint "%s" 
>> %(repr(trdd.collect()))
>> from pyspark.mllib.common import callMLlibFunc
>> pic = callMLlibFunc("foobar", trdd)
>>
>> Relevant portions of results:
>>
>> [(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5]
>>
>> which shows the input rdd is 'whole'. However the pickling was unhappy:
>>
>> 5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14)
>> net.razorvine.pickle.PickleException: expected zero arguments for 
>> construction of ClassDict(for pyspark.mllib.clustering.MatrixEntry)
>> at 
>> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
>> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
>> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
>> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
>> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>> at 
>> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167)
>> at 
>> org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at 
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> at 
>> scala.collection.mutabl

Re: Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng

This is being discussed in
https://issues.apache.org/jira/browse/SPARK-6407. Let's move the
discussion there. Thanks for providing references! -Xiangrui

On Sun, Apr 5, 2015 at 11:48 PM, Chunnan Yao  wrote:
> On-line Collaborative Filtering(CF) has been widely used and studied. To
> re-train a CF model from scratch every time when new data comes in is very
> inefficient
> (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
> However, in Spark community we see few discussion about collaborative
> filtering on streaming data. Given streaming k-means, streaming logistic
> regression, and the on-going incremental model training of Naive Bayes
> Classifier (SPARK-4144), we think it is meaningful to consider streaming
> Collaborative Filtering support on MLlib.
>
> I've created an issue on JIRA (SPARK-6711) for possible discussions. We
> suggest to refer to this paper
> (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on
> SGD instead of ALS, which is easier to be tackled under streaming data.
>
> Fortunately, the authors of this paper have implemented their algorithm as a
> Github Project, based on Storm:
> https://github.com/MrChrisJohnson/CollabStream
>
> Please don't hesitate to give your opinions on this issue and our planned
> approach. We'd like to work on this in the next few weeks.
>
>
>
> -
> Feel the sparking Spark!
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Support-parallelized-online-matrix-factorization-for-Collaborative-Filtering-tp11413.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Stochastic gradient descent performance

2015-04-06 Thread Xiangrui Meng

The gap sampling is triggered when the sampling probability is small
and the directly underlying storage has constant time lookups, in
particular, ArrayBuffer. This is a very strict requirement. If rdd is
cached in memory, we use ArrayBuffer to store its elements and
rdd.sample will trigger gap sampling. However, if we call rdd2 =
rdd.map(x => x), we can no longer tell whether the storage is backed
by an ArrayBuffer and hence gaps sampling is not enabled. We should
use Scala's drop(k) and let Scala decides whether this is an O(1)
operation or an O(k) operation. But unfortunately, due to a Scala bug,
this could become an O(k^2) operation. So we didn't use this approach.
Please check the comments in the gap sampling PR.

For SGD, I think we should either assume the input data is randomized
or randomize the input data (and eat this one-time cost), then do
min-batch sequentially. The key is the balance the batch size and the
communication cost of model update.

Best,
Xiangrui

On Mon, Apr 6, 2015 at 10:38 AM, Ulanov, Alexander
 wrote:
> Batch size impacts convergence, so bigger batch means more iterations. There 
> are some approaches to deal with it (such as 
> http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf), but they need to be 
> implemented and tested.
>
> Nonetheless, could you share your thoughts regarding reducing this overhead 
> in Spark (or probably a workaround)? Sorry for repeating it, but I think this 
> is crucial for MLlib in Spark, because Spark is intended for bigger amounts 
> of data. Machine learning with bigger data usually requires SGD (vs batch 
> GD), SGD requires a lot of updates, and “Spark overhead” times “many updates” 
> equals impractical time needed for learning.
>
>
> From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
> Sent: Sunday, April 05, 2015 7:13 PM
> To: Ulanov, Alexander
> Cc: shiva...@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org
> Subject: Re: Stochastic gradient descent performance
>
> Yeah, a simple way to estimate the time for an iterative algorithms is number 
> of iterations required * time per iteration. The time per iteration will 
> depend on the batch size, computation required and the fixed overheads I 
> mentioned before. The number of iterations of course depends on the 
> convergence rate for the problem being solved.
>
> Thanks
> Shivaram
>
> On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander 
> mailto:alexander.ula...@hp.com>> wrote:
> Hi Shivaram,
>
> It sounds really interesting! With this time we can estimate if it worth 
> considering to run an iterative algorithm on Spark. For example, for SGD on 
> Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all 
> data by one example not considering the data loading, computation and update 
> times. One may need to traverse all data a number of times to converge. Let’s 
> say this number is equal to the batch size. So, we remain with 62.5 hours 
> overhead. Is it reasonable?
>
> Best regards, Alexander
>
> From: Shivaram Venkataraman 
> [mailto:shiva...@eecs.berkeley.edu]
> Sent: Thursday, April 02, 2015 1:26 PM
> To: Joseph Bradley
> Cc: Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Stochastic gradient descent performance
>
> I haven't looked closely at the sampling issues, but regarding the 
> aggregation latency, there are fixed overheads (in local and distributed 
> mode) with the way aggregation is done in Spark. Launching a stage of tasks, 
> fetching outputs from the previous stage etc. all have overhead, so I would 
> say its not efficient / recommended to run stages where computation is less 
> than 500ms or so. You could increase your batch size based on this and 
> hopefully that will help.
>
> Regarding reducing these overheads by an order of magnitude it is a 
> challenging problem given the architecture in Spark -- I have some ideas for 
> this, but they are very much at a research stage.
>
> Thanks
> Shivaram
>
> On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley 
> mailto:jos...@databricks.com>> wrote:
> When you say "It seems that instead of sample it is better to shuffle data
> and then access it sequentially by mini-batches," are you sure that holds
> true for a big dataset in a cluster?  As far as implementing it, I haven't
> looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
> but that looks like it could be modified to be deterministic.
>
> Hopefully someone else can comment on aggregation in local mode.  I'm not
> sure how much effort has gone into optimizing for local mode.
>
> Joseph
>
> On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander 
> mailto:alexander.ula...@hp.com>>
> wrote:
>
>>  Hi Joseph,
>>
>>
>>
>> Thank you for suggestion!
>>
>> It seems that instead of sample it is better to shuffle data and then
>> access it sequentially by mini-batches. Could you suggest how to implement
>> it?
>>
>>
>>
>> With regards to aggregate (reduce), I am

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Xiangrui Meng

+1 Verified some MLlib bug fixes on OS X. -Xiangrui

On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen  wrote:
> Signatures and hashes are good.
> LICENSE, NOTICE still check out.
> Compiles for a Hadoop 2.6 + YARN + Hive profile.
>
> I still see the UISeleniumSuite test failure observed in 1.3.0, which
> is minor and already fixed. I don't know why I didn't back-port it:
> https://issues.apache.org/jira/browse/SPARK-6205
>
> If we roll another, let's get this easy fix in, but it is only an
> issue with tests.
>
>
> On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
> all look legitimate (e.g. reopened or in progress)
>
>
> There is 1 open Blocker for 1.3.1 per Andrew:
> https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
> start even when spark was built in Windows
>
> I believe this can be resolved quickly but as a matter of hygiene
> should be fixed or demoted before release.
>
>
> FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
> examining before release to see how critical they are:
>
> SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
> application,,Open,4/3/15
> SPARK-6484,"Ganglia metrics xml reporter doesn't escape
> correctly",Josh Rosen,Open,3/24/15
> SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15
> SPARK-6209,ExecutorClassLoader can leak connections after failing to
> load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
> SPARK-5113,Audit and document use of hostnames and IP addresses in
> Spark,,Open,3/24/15
> SPARK-5098,Number of running tasks become negative after tasks
> lost,,Open,1/14/15
> SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
> Wendell,Reopened,3/23/15
> SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15
> SPARK-4888,"Spark EC2 doesn't mount local disks for i2.8xlarge
> instances",,Open,1/27/15
> SPARK-4879,Missing output partitions after job completes with
> speculative execution,Josh Rosen,Open,3/5/15
> SPARK-4751,Support dynamic allocation for standalone mode,Andrew
> Or,Open,12/22/14
> SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
> SPARK-4452,Shuffle data structures can starve others on the same
> thread for memory,Tianshuo Deng,Open,1/24/15
> SPARK-4352,Incorporate locality preferences in dynamic allocation
> requests,,Open,1/26/15
> SPARK-4227,Document external shuffle service,,Open,3/23/15
> SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15
>
> On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.3.1!
>>
>> The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>>
>> The list of fixes present in this release can be found at:
>> http://bit.ly/1C2nVPY
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1080
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.1!
>>
>> The vote is open until Wednesday, April 08, at 01:10 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Xiangrui Meng

This is great! Thanks! -Xiangrui

On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander
 wrote:
> FYI, I've added instructions to Netlib-java wiki, Sam added the link to them 
> from the project's readme.md
> https://github.com/fommil/netlib-java/wiki/NVBLAS
>
> Best regards, Alexander
> -Original Message-
> From: Xiangrui Meng [mailto:men...@gmail.com]
> Sent: Monday, March 30, 2015 2:43 PM
> To: Sean Owen
> Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; 
> jfcanny
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alex,
>
> Since it is non-trivial to make nvblas work with netlib-java, it would be 
> great if you can send the instructions to netlib-java as part of the README. 
> Hopefully we don't need to modify netlib-java code to use nvblas.
>
> Best,
> Xiangrui
>
> On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen  wrote:
>> The license issue is with libgfortran, rather than OpenBLAS.
>>
>> (FWIW I am going through the motions to get OpenBLAS set up by default
>> on CDH in the near future, and the hard part is just handling
>> libgfortran.)
>>
>> On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks  
>> wrote:
>>> Alright Sam - you are the expert here. If the GPL issues are
>>> unavoidable, that's fine - what is the exact bit of code that is GPL?
>>>
>>> The suggestion to use OpenBLAS is not to say it's the best option,
>>> but that it's a *free, reasonable default* for many users - keep in
>>> mind the most common deployment for Spark/MLlib is on 64-bit linux on 
>>> EC2[1].
>>> Additionally, for many of the problems we're targeting, this
>>> reasonable default can provide a 1-2 orders of magnitude improvement
>>> in performance over the f2jblas implementation that netlib-java falls back 
>>> on.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
>> additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Xiangrui Meng

Hi Alex,

Since it is non-trivial to make nvblas work with netlib-java, it would
be great if you can send the instructions to netlib-java as part of
the README. Hopefully we don't need to modify netlib-java code to use
nvblas.

Best,
Xiangrui

On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen  wrote:
> The license issue is with libgfortran, rather than OpenBLAS.
>
> (FWIW I am going through the motions to get OpenBLAS set up by default
> on CDH in the near future, and the hard part is just handling
> libgfortran.)
>
> On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks  wrote:
>> Alright Sam - you are the expert here. If the GPL issues are unavoidable,
>> that's fine - what is the exact bit of code that is GPL?
>>
>> The suggestion to use OpenBLAS is not to say it's the best option, but that
>> it's a *free, reasonable default* for many users - keep in mind the most
>> common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
>> Additionally, for many of the problems we're targeting, this reasonable
>> default can provide a 1-2 orders of magnitude improvement in performance
>> over the f2jblas implementation that netlib-java falls back on.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: mllib.recommendation Design

2015-03-30 Thread Xiangrui Meng

On Wed, Mar 25, 2015 at 7:59 AM, Debasish Das  wrote:
> Hi Xiangrui,
>
> I am facing some minor issues in implementing Alternating Nonlinear
> Minimization as documented in this JIRA due to the ALS code being in ml
> package: https://issues.apache.org/jira/browse/SPARK-6323
>
> I need to use Vectors.fromBreeze / Vectors.toBreeze but they are package
> private on mllib. For now I removed private but not sure that's the correct
> way...

We don't expose 3rd-party types in our public APIs. You can either
implement your algorithm under org.apache.spark or copy the
fromBreeze/toBreeze code over.

>
> I also need to re-use lot of building blocks from ml.ALS and so I am writing
> ALM in ml package...
>

That sounds okay.

> I thought the plan was to still write core algorithms in mllib and pipeline
> integration in ml...It will be great if you can move the ALS object from ml
> to mllib and that way I can also move ALM to mllib (which I feel is the
> right place)...Of course the Pipeline based flow will stay in ml package...
>

It breaks compatibility if we move it. I think it should be quite
flexible about where we put the implementation.

> We can decide later if ALM needs to be in recommendation or a better place
> is package called factorization but the idea is that ALM will support MAP
> (and may be KL divergence loss) with sparsity constraints (probability
> simplex and bounds are fine for what I am focused at right now)...
>

I'm really sorry about the late response on this. It is partially
because that I'm still not sure about whether there exist many
applications that need this feature. Please do list some public work
and help us to understand the need.

> Thanks.
> Deb
>
> On Tue, Feb 17, 2015 at 4:40 PM, Debasish Das 
> wrote:
>>
>> There is a usability difference...I am not sure if recommendation.ALS
>> would like to add both userConstraint and productConstraint ? GraphLab CF
>> for example has it and we are ready to support all the features for modest
>> ranks where gram matrices can be made...
>>
>> For large ranks I am still working on the code
>>
>> On Tue, Feb 17, 2015 at 3:19 PM, Xiangrui Meng  wrote:
>>>
>>> The current ALS implementation allow pluggable solvers for
>>> NormalEquation, where we put CholeskeySolver and NNLS solver. Please
>>> check the current implementation and let us know how your constraint
>>> solver would fit. For a general matrix factorization package, let's
>>> make a JIRA and move our discussion there. -Xiangrui
>>>
>>> On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das 
>>> wrote:
>>> > Hi,
>>> >
>>> > I am bit confused on the mllib design in the master. I thought that
>>> > core
>>> > algorithms will stay in mllib and ml will define the pipelines over the
>>> > core algorithm but looks like in master ALS is moved from mllib to
>>> > ml...
>>> >
>>> > I am refactoring my PR to a factorization package and I want to build
>>> > it on
>>> > top of ml.recommendation.ALS (possibly extend from
>>> > ml.recommendation.ALS
>>> > since first version will use very similar RDD handling as ALS and a
>>> > proximal solver that's being added to breeze)
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-2426
>>> > https://github.com/scalanlp/breeze/pull/321
>>> >
>>> > Basically I am not sure if we should merge it with recommendation.ALS
>>> > since
>>> > this is more generic than recommendation. I am considering calling it
>>> > ConstrainedALS where user can specify different constraint for user and
>>> > product factors (Similar to GraphLab CF structure).
>>> >
>>> > I am also working on ConstrainedALM where the underlying algorithm is
>>> > no
>>> > longer ALS but nonlinear alternating minimization with constraints.
>>> > https://github.com/scalanlp/breeze/pull/364
>>> > This will let us do large rank matrix completion where there is no need
>>> > to
>>> > construct gram matrices. I will open up the JIRA soon after getting
>>> > initial
>>> > results
>>> >
>>> > I am bit confused that where should I add the factorization package. It
>>> > will use the current ALS test-cases and I have to construct more
>>> > test-cases
>>> > for sparse coding and PLSA formulations.
>>> >
>>> > Thanks.
>>> > Deb
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: enum-like types in Spark

2015-03-17 Thread Xiangrui Meng

Let me put a quick summary. #4 got majority vote with CamelCase but
not UPPERCASE. The following is a minimal implementation that works
for both Scala and Java. In Python, we use string for enums. This
proposal is only for new public APIs. We are not going to change
existing ones. -Xiangrui

~~~
sealed abstract class StorageLevel

object StorageLevel {

  def fromString(name: String): StorageLevel = ???

  val MemoryOnly: StorageLevel = {
case object MemoryOnly extends StorageLevel
MemoryOnly
  }

  val DiskOnly: StorageLevel = {
case object DiskOnly extends StorageLevel
DiskOnly
 }
}
~~~

On Mon, Mar 16, 2015 at 3:04 PM, Aaron Davidson  wrote:
> It's unrelated to the proposal, but Enum#ordinal() should be much faster,
> assuming it's not serialized to JVMs with different versions of the enum :)
>
> On Mon, Mar 16, 2015 at 12:12 PM, Kevin Markey 
> wrote:
>
>> In some applications, I have rather heavy use of Java enums which are
>> needed for related Java APIs that the application uses.  And unfortunately,
>> they are also used as keys.  As such, using the native hashcodes makes any
>> function over keys unstable and unpredictable, so we now use Enum.name() as
>> the key instead.  Oh well.  But it works and seems to work well.
>>
>> Kevin
>>
>>
>> On 03/05/2015 09:49 PM, Mridul Muralidharan wrote:
>>
>>>I have a strong dislike for java enum's due to the fact that they
>>> are not stable across JVM's - if it undergoes serde, you end up with
>>> unpredictable results at times [1].
>>> One of the reasons why we prevent enum's from being key : though it is
>>> highly possible users might depend on it internally and shoot
>>> themselves in the foot.
>>>
>>> Would be better to keep away from them in general and use something more
>>> stable.
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] Having had to debug this issue for 2 weeks - I really really hate it.
>>>
>>>
>>> On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
>>> wrote:
>>>
>>>> I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>>>> #4
>>>> (with Xiangrui's final suggestion, especially making it sealed &
>>>> available
>>>> in Java), but I really think #2, java enums, are the best option.
>>>>
>>>> Java enums actually have some very real advantages over the other
>>>> approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>>>> has
>>>> been endless debate in the Scala community about the problems with the
>>>> approaches in Scala.  Very smart, level-headed Scala gurus have
>>>> complained
>>>> about their short-comings (Rex Kerr's name is coming to mind, though I'm
>>>> not positive about that); there have been numerous well-thought out
>>>> proposals to give Scala a better enum.  But the powers-that-be in Scala
>>>> always reject them.  IIRC the explanation for rejecting is basically that
>>>> (a) enums aren't important enough for introducing some new special
>>>> feature,
>>>> scala's got bigger things to work on and (b) if you really need a good
>>>> enum, just use java's enum.
>>>>
>>>> I doubt it really matters that much for Spark internals, which is why I
>>>> think #4 is fine.  But I figured I'd give my spiel, because every
>>>> developer
>>>> loves language wars :)
>>>>
>>>> Imran
>>>>
>>>>
>>>>
>>>> On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:
>>>>
>>>>  `case object` inside an `object` doesn't show up in Java. This is the
>>>>> minimal code I found to make everything show up correctly in both
>>>>> Scala and Java:
>>>>>
>>>>> sealed abstract class StorageLevel // cannot be a trait
>>>>>
>>>>> object StorageLevel {
>>>>>private[this] case object _MemoryOnly extends StorageLevel
>>>>>final val MemoryOnly: StorageLevel = _MemoryOnly
>>>>>
>>>>>private[this] case object _DiskOnly extends StorageLevel
>>>>>final val DiskOnly: StorageLevel = _DiskOnly
>>>>> }
>>>>>
>>>>> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
>>>>> wrote:
>>>>>
>>>>>> I like #4 as well and agree with Aaron's suggestion.
>>&

Re: enum-like types in Spark

2015-03-16 Thread Xiangrui Meng

In MLlib, we use strings for emu-like types in Python APIs, which is
quite common in Python and easy for py4j. On the JVM side, we
implement `fromString` to convert them back to enums. -Xiangrui

On Wed, Mar 11, 2015 at 12:56 PM, RJ Nowling  wrote:
> How do these proposals affect PySpark?  I think compatibility with PySpark
> through Py4J should be considered.
>
> On Mon, Mar 9, 2015 at 8:39 PM, Patrick Wendell  wrote:
>
>> Does this matter for our own internal types in Spark? I don't think
>> any of these types are designed to be used in RDD records, for
>> instance.
>>
>> On Mon, Mar 9, 2015 at 6:25 PM, Aaron Davidson  wrote:
>> > Perhaps the problem with Java enums that was brought up was actually that
>> > their hashCode is not stable across JVMs, as it depends on the memory
>> > location of the enum itself.
>> >
>> > On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid 
>> wrote:
>> >
>> >> Can you expand on the serde issues w/ java enum's at all?  I haven't
>> heard
>> >> of any problems specific to enums.  The java object serialization rules
>> >> seem very clear and it doesn't seem like different jvms should have a
>> >> choice on what they do:
>> >>
>> >>
>> >>
>> http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469
>> >>
>> >> (in a nutshell, serialization must use enum.name())
>> >>
>> >> of course there are plenty of ways the user could screw this up(eg.
>> rename
>> >> the enums, or change their meaning, or remove them).  But then again,
>> all
>> >> of java serialization has issues w/ serialization the user has to be
>> aware
>> >> of.  Eg., if we go with case objects, than java serialization blows up
>> if
>> >> you add another helper method, even if that helper method is completely
>> >> compatible.
>> >>
>> >> Some prior debate in the scala community:
>> >>
>> >>
>> https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ
>> >>
>> >> SO post on which version to use in scala:
>> >>
>> >>
>> >>
>> http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types
>> >>
>> >> SO post about the macro-craziness people try to add to scala to make
>> them
>> >> almost as good as a simple java enum:
>> >> (NB: the accepted answer doesn't actually work in all cases ...)
>> >>
>> >>
>> >>
>> http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched
>> >>
>> >> Another proposal to add better enums built into scala ... but seems to
>> be
>> >> dormant:
>> >>
>> >> https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk
>> >>
>> >>
>> >>
>> >> On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan 
>> >> wrote:
>> >>
>> >> >   I have a strong dislike for java enum's due to the fact that they
>> >> > are not stable across JVM's - if it undergoes serde, you end up with
>> >> > unpredictable results at times [1].
>> >> > One of the reasons why we prevent enum's from being key : though it is
>> >> > highly possible users might depend on it internally and shoot
>> >> > themselves in the foot.
>> >> >
>> >> > Would be better to keep away from them in general and use something
>> more
>> >> > stable.
>> >> >
>> >> > Regards,
>> >> > Mridul
>> >> >
>> >> > [1] Having had to debug this issue for 2 weeks - I really really hate
>> it.
>> >> >
>> >> >
>> >> > On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
>> >> wrote:
>> >> > > I have a very strong dislike for #1 (scala enumerations).   I'm ok
>> with
>> >> > #4
>> >> > > (with Xiangrui's final suggestion, especially making it sealed &
>> >> > available
>> >> > > in Java), but I really think #2, java enums, are the best option.
>> >> > >
>> >> > > Java enums actually have some very real advantages over the other
>> >> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap.
>> There
>> >>

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Xiangrui Meng

Krishna, I tested your linear regression example. For linear
regression, we changed its objective function from 1/n * \|A x -
b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
squares formulations. It means you could re-produce the same result by
multiplying the step size by 2. This is not a problem if both run
until convergence (if not blow up). However, in your example, a very
small step size is chosen and it didn't converge in 100 iterations. In
this case, the step size matters. I will put a note in the migration
guide. Thanks! -Xiangrui

On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen  wrote:
> I'm +1 as I have not heard of any one else seeing the Hive test
> failure, which is likely a test issue rather than code issue anyway,
> and not a blocker.
>
> On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen  wrote:
>> Although the problem is small, especially if indeed the essential docs
>> changes are following just a couple days behind the final release, I
>> mean, why the rush if they're essential? wait a couple days, finish
>> them, make the release.
>>
>> Answer is, I think these changes aren't actually essential given the
>> comment from tdas, so: just mark these Critical? (although ... they do
>> say they're changes for the 1.3 release, so kind of funny to get to
>> them for 1.3.x or 1.4, but that's not important now.)
>>
>> I thought that Blocker really meant Blocker in this project, as I've
>> been encouraged to use it to mean "don't release without this." I
>> think we should use it that way. Just thinking of it as "extra
>> Critical" doesn't add anything. I don't think Documentation should be
>> special-cased as less important, and I don't think there's confusion
>> if Blocker means what it says, so I'd 'fix' that way.
>>
>> If nobody sees the Hive failure I observed, and if we can just zap
>> those "Blockers" one way or the other, +1
>>
>>
>> On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell  wrote:
>>> Sean,
>>>
>>> The docs are distributed and consumed in a fundamentally different way
>>> than Spark code itself. So we've always considered the "deadline" for
>>> doc changes to be when the release is finally posted.
>>>
>>> If there are small inconsistencies with the docs present in the source
>>> code for that release tag, IMO that doesn't matter much since we don't
>>> even distribute the docs with Spark's binary releases and virtually no
>>> one builds and hosts the docs on their own (that I am aware of, at
>>> least). Perhaps we can recommend if people want to build the doc
>>> sources that they should always grab the head of the most recent
>>> release branch, to set expectations accordingly.
>>>
>>> In the past we haven't considered it worth holding up the release
>>> process for the purpose of the docs. It just doesn't make sense since
>>> they are consumed "as a service". If we decide to change this
>>> convention, it would mean shipping our releases later, since we
>>> could't pipeline the doc finalization with voting.
>>>
>>> - Patrick
>>>
>>> On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
 Given the title and tagging, it sounds like there could be some
 must-have doc changes to go with what is being released as 1.3. It can
 be finished later, and published later, but then the docs source
 shipped with the release doesn't match the site, and until then, 1.3
 is released without some "must-have" docs for 1.3 on the site.

 The real question to me is: are there any further, absolutely
 essential doc changes that need to accompany 1.3 or not?

 If not, just resolve these. If there are, then it seems like the
 release has to block on them. If there are some docs that should have
 gone in for 1.3, but didn't, but aren't essential, well I suppose it
 bears thinking about how to not slip as much work, but it doesn't
 block.

 I think Documentation issues certainly can be a blocker and shouldn't
 be specially ignored.


 BTW the UISeleniumSuite issue is a real failure, but I do not think it
 is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
 a regression from 1.2.x, but only affects tests, and only affects a
 subset of build profiles.




 On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  wrote:
> Hey Sean,
>
>> SPARK-5310 Update SQL programming guide for 1.3
>> SPARK-5183 Document data source API
>> SPARK-6128 Update Spark Streaming Guide for Spark 1.3
>
> For these, the issue is that they are documentation JIRA's, which
> don't need to be timed exactly with the release vote, since we can
> update the documentation on the website whenever we want. In the past
> I've just mentally filtered these out when considering RC's. I see a
> few options here:
>
> 1. We downgrade such issues away from Blocker (more clear, but we risk
> loosing them in the fray if they really are things we want to have
> before

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng

Well, it is the standard "hacky" way for model save/load in MLlib. We
have SPARK-4587 and SPARK-5991 to provide save/load for all MLlib
models, in an exchangeable format. -Xiangrui

On Mon, Mar 9, 2015 at 12:25 PM, Ulanov, Alexander
 wrote:
> Thanks so much! It works! Is it the standard way for Mllib models to be 
> serialized?
>
> Btw. The example I pasted below works if one implements a TestSuite with 
> MLlibTestSparkContext.
>
> -Original Message-
> From: Xiangrui Meng [mailto:men...@gmail.com]
> Sent: Monday, March 09, 2015 12:10 PM
> To: Ulanov, Alexander
> Cc: Akhil Das; dev
> Subject: Re: Loading previously serialized object to Spark
>
> Could you try `sc.objectFile` instead?
>
> sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel = 
> sc.objectFile[NaiveBayesModel]("path").first()
>
> -Xiangrui
>
> On Mon, Mar 9, 2015 at 11:52 AM, Ulanov, Alexander  
> wrote:
>> Just tried, the same happens if I use the internal Spark serializer:
>> val serializer = SparkEnv.get.closureSerializer.newInstance
>>
>>
>> -Original Message-
>> From: Ulanov, Alexander
>> Sent: Monday, March 09, 2015 10:37 AM
>> To: Akhil Das
>> Cc: dev
>> Subject: RE: Loading previously serialized object to Spark
>>
>> Below is the code with standard MLlib class. Apparently this issue can 
>> happen in the same Spark instance.
>>
>> import java.io._
>>
>> import org.apache.spark.mllib.classification.NaiveBayes
>> import org.apache.spark.mllib.classification.NaiveBayesModel
>> import org.apache.spark.mllib.util.MLUtils
>>
>> val data = MLUtils.loadLibSVMFile(sc,
>> "hdfs://myserver:9000/data/mnist.scale")
>> val nb = NaiveBayes.train(data)
>> // RDD map works fine
>> val predictionAndLabels = data.map( lp =>
>> (nb.classifierModel.predict(lp.features), lp.label))
>>
>> // serialize the model to file and immediately load it val oos = new
>> ObjectOutputStream(new FileOutputStream("/home/myuser/nb.bin"))
>> oos.writeObject(nb)
>> oos.close
>> val ois = new ObjectInputStream(new
>> FileInputStream("/home/myuser/nb.bin"))
>> val nbSerialized = ois.readObject.asInstanceOf[NaiveBayesModel]
>> ois.close
>> // RDD map fails
>> val predictionAndLabels = data.map( lp =>
>> (nbSerialized.predict(lp.features), lp.label))
>> org.apache.spark.SparkException: Task not serializable
>> at 
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>> at 
>> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
>> at org.apache.spark.rdd.RDD.map(RDD.scala:273)
>>
>>
>> From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> Sent: Sunday, March 08, 2015 3:17 AM
>> To: Ulanov, Alexander
>> Cc: dev
>> Subject: Re: Loading previously serialized object to Spark
>>
>> Can you paste the complete code?
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander 
>> mailto:alexander.ula...@hp.com>> wrote:
>> Hi,
>>
>> I've implemented class MyClass in MLlib that does some operation on 
>> LabeledPoint. MyClass extends serializable, so I can map this operation on 
>> data of RDD[LabeledPoints], such as data.map(lp => MyClass.operate(lp)). I 
>> write this class in file with ObjectOutputStream.writeObject. Then I stop 
>> and restart Spark. I load this class from file with 
>> ObjectInputStream.readObject.asInstanceOf[MyClass]. When I try to map the 
>> same operation of this class to RDD, Spark throws not serializable exception:
>> org.apache.spark.SparkException: Task not serializable
>> at 
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>> at 
>> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
>> at org.apache.spark.rdd.RDD.map(RDD.scala:273)
>>
>> Could you suggest why it throws this exception while MyClass is serializable 
>> by definition?
>>
>> Best regards, Alexander
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Loading previously serialized object to Spark

2015-03-09 Thread Xiangrui Meng

Could you try `sc.objectFile` instead?

sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
val sameModel = sc.objectFile[NaiveBayesModel]("path").first()

-Xiangrui

On Mon, Mar 9, 2015 at 11:52 AM, Ulanov, Alexander
 wrote:
> Just tried, the same happens if I use the internal Spark serializer:
> val serializer = SparkEnv.get.closureSerializer.newInstance
>
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Monday, March 09, 2015 10:37 AM
> To: Akhil Das
> Cc: dev
> Subject: RE: Loading previously serialized object to Spark
>
> Below is the code with standard MLlib class. Apparently this issue can happen 
> in the same Spark instance.
>
> import java.io._
>
> import org.apache.spark.mllib.classification.NaiveBayes
> import org.apache.spark.mllib.classification.NaiveBayesModel
> import org.apache.spark.mllib.util.MLUtils
>
> val data = MLUtils.loadLibSVMFile(sc, "hdfs://myserver:9000/data/mnist.scale")
> val nb = NaiveBayes.train(data)
> // RDD map works fine
> val predictionAndLabels = data.map( lp => 
> (nb.classifierModel.predict(lp.features), lp.label))
>
> // serialize the model to file and immediately load it val oos = new 
> ObjectOutputStream(new FileOutputStream("/home/myuser/nb.bin"))
> oos.writeObject(nb)
> oos.close
> val ois = new ObjectInputStream(new FileInputStream("/home/myuser/nb.bin"))
> val nbSerialized = ois.readObject.asInstanceOf[NaiveBayesModel]
> ois.close
> // RDD map fails
> val predictionAndLabels = data.map( lp => (nbSerialized.predict(lp.features), 
> lp.label))
> org.apache.spark.SparkException: Task not serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
> at org.apache.spark.rdd.RDD.map(RDD.scala:273)
>
>
> From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
> Sent: Sunday, March 08, 2015 3:17 AM
> To: Ulanov, Alexander
> Cc: dev
> Subject: Re: Loading previously serialized object to Spark
>
> Can you paste the complete code?
>
> Thanks
> Best Regards
>
> On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander 
> mailto:alexander.ula...@hp.com>> wrote:
> Hi,
>
> I've implemented class MyClass in MLlib that does some operation on 
> LabeledPoint. MyClass extends serializable, so I can map this operation on 
> data of RDD[LabeledPoints], such as data.map(lp => MyClass.operate(lp)). I 
> write this class in file with ObjectOutputStream.writeObject. Then I stop and 
> restart Spark. I load this class from file with 
> ObjectInputStream.readObject.asInstanceOf[MyClass]. When I try to map the 
> same operation of this class to RDD, Spark throws not serializable exception:
> org.apache.spark.SparkException: Task not serializable
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
> at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1453)
> at org.apache.spark.rdd.RDD.map(RDD.scala:273)
>
> Could you suggest why it throws this exception while MyClass is serializable 
> by definition?
>
> Best regards, Alexander
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: enum-like types in Spark

2015-03-05 Thread Xiangrui Meng

For #4, my previous proposal may confuse the IDEs with additional
types generated by the case objects, and their toString contain the
underscore. The following works better:

sealed abstract class StorageLevel

object StorageLevel {
  final val MemoryOnly: StorageLevel = {
case object MemoryOnly extends StorageLevel
MemoryOnly
  }

  final val DiskOnly: StorageLevel = {
case object DiskOnly extends StorageLevel
DiskOnly
 }
}

MemoryOnly and DiskOnly can be used in pattern matching. If people are
okay with this approach, I can add it to the code style guide.

Imran, this is not just for internal APIs, which are relatively more
flexible. It is good to use the same approach to implement public
enum-like types from now on.

Best,
Xiangrui

On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid  wrote:
> I have a very strong dislike for #1 (scala enumerations).   I'm ok with #4
> (with Xiangrui's final suggestion, especially making it sealed & available
> in Java), but I really think #2, java enums, are the best option.
>
> Java enums actually have some very real advantages over the other
> approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There has
> been endless debate in the Scala community about the problems with the
> approaches in Scala.  Very smart, level-headed Scala gurus have complained
> about their short-comings (Rex Kerr's name is coming to mind, though I'm
> not positive about that); there have been numerous well-thought out
> proposals to give Scala a better enum.  But the powers-that-be in Scala
> always reject them.  IIRC the explanation for rejecting is basically that
> (a) enums aren't important enough for introducing some new special feature,
> scala's got bigger things to work on and (b) if you really need a good
> enum, just use java's enum.
>
> I doubt it really matters that much for Spark internals, which is why I
> think #4 is fine.  But I figured I'd give my spiel, because every developer
> loves language wars :)
>
> Imran
>
>
>
> On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:
>
>> `case object` inside an `object` doesn't show up in Java. This is the
>> minimal code I found to make everything show up correctly in both
>> Scala and Java:
>>
>> sealed abstract class StorageLevel // cannot be a trait
>>
>> object StorageLevel {
>>   private[this] case object _MemoryOnly extends StorageLevel
>>   final val MemoryOnly: StorageLevel = _MemoryOnly
>>
>>   private[this] case object _DiskOnly extends StorageLevel
>>   final val DiskOnly: StorageLevel = _DiskOnly
>> }
>>
>> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
>> wrote:
>> > I like #4 as well and agree with Aaron's suggestion.
>> >
>> > - Patrick
>> >
>> > On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
>> wrote:
>> >> I'm cool with #4 as well, but make sure we dictate that the values
>> should
>> >> be defined within an object with the same name as the enumeration (like
>> we
>> >> do for StorageLevel). Otherwise we may pollute a higher namespace.
>> >>
>> >> e.g. we SHOULD do:
>> >>
>> >> trait StorageLevel
>> >> object StorageLevel {
>> >>   case object MemoryOnly extends StorageLevel
>> >>   case object DiskOnly extends StorageLevel
>> >> }
>> >>
>> >> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust <
>> mich...@databricks.com>
>> >> wrote:
>> >>
>> >>> #4 with a preference for CamelCaseEnums
>> >>>
>> >>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
>> >>> wrote:
>> >>>
>> >>> > another vote for #4
>> >>> > People are already used to adding "()" in Java.
>> >>> >
>> >>> >
>> >>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 
>> >>> wrote:
>> >>> >
>> >>> > > #4 but with MemoryOnly (more scala-like)
>> >>> > >
>> >>> > > http://docs.scala-lang.org/style/naming-conventions.html
>> >>> > >
>> >>> > > Constants, Values, Variable and Methods
>> >>> > >
>> >>> > > Constant names should be in upper camel case. That is, if the
>> member is
>> >>> > > final, immutable and it belongs to a package object or an object,
>> it
>> >>> may
>> >>> > be
>> >>> > > considered a const

Re: enum-like types in Spark

2015-03-04 Thread Xiangrui Meng

`case object` inside an `object` doesn't show up in Java. This is the
minimal code I found to make everything show up correctly in both
Scala and Java:

sealed abstract class StorageLevel // cannot be a trait

object StorageLevel {
  private[this] case object _MemoryOnly extends StorageLevel
  final val MemoryOnly: StorageLevel = _MemoryOnly

  private[this] case object _DiskOnly extends StorageLevel
  final val DiskOnly: StorageLevel = _DiskOnly
}

On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell  wrote:
> I like #4 as well and agree with Aaron's suggestion.
>
> - Patrick
>
> On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson  wrote:
>> I'm cool with #4 as well, but make sure we dictate that the values should
>> be defined within an object with the same name as the enumeration (like we
>> do for StorageLevel). Otherwise we may pollute a higher namespace.
>>
>> e.g. we SHOULD do:
>>
>> trait StorageLevel
>> object StorageLevel {
>>   case object MemoryOnly extends StorageLevel
>>   case object DiskOnly extends StorageLevel
>> }
>>
>> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust 
>> wrote:
>>
>>> #4 with a preference for CamelCaseEnums
>>>
>>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
>>> wrote:
>>>
>>> > another vote for #4
>>> > People are already used to adding "()" in Java.
>>> >
>>> >
>>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 
>>> wrote:
>>> >
>>> > > #4 but with MemoryOnly (more scala-like)
>>> > >
>>> > > http://docs.scala-lang.org/style/naming-conventions.html
>>> > >
>>> > > Constants, Values, Variable and Methods
>>> > >
>>> > > Constant names should be in upper camel case. That is, if the member is
>>> > > final, immutable and it belongs to a package object or an object, it
>>> may
>>> > be
>>> > > considered a constant (similar to Java'sstatic final members):
>>> > >
>>> > >
>>> > >1. object Container {
>>> > >2. val MyConstant = ...
>>> > >3. }
>>> > >
>>> > >
>>> > > 2015-03-04 17:11 GMT-08:00 Xiangrui Meng :
>>> > >
>>> > > > Hi all,
>>> > > >
>>> > > > There are many places where we use enum-like types in Spark, but in
>>> > > > different ways. Every approach has both pros and cons. I wonder
>>> > > > whether there should be an "official" approach for enum-like types in
>>> > > > Spark.
>>> > > >
>>> > > > 1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc)
>>> > > >
>>> > > > * All types show up as Enumeration.Value in Java.
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
>>> > > >
>>> > > > 2. Java's Enum (e.g., SaveMode, IOMode)
>>> > > >
>>> > > > * Implementation must be in a Java file.
>>> > > > * Values doesn't show up in the ScalaDoc:
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
>>> > > >
>>> > > > 3. Static fields in Java (e.g., TripletFields)
>>> > > >
>>> > > > * Implementation must be in a Java file.
>>> > > > * Doesn't need "()" in Java code.
>>> > > > * Values don't show up in the ScalaDoc:
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
>>> > > >
>>> > > > 4. Objects in Scala. (e.g., StorageLevel)
>>> > > >
>>> > > > * Needs "()" in Java code.
>>> > > > * Values show up in both ScalaDoc and JavaDoc:
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
>>> > > >
>>> > > > It would be great if we have an "official" approach for this as well
>>> > > > as the naming convention for enum-like values ("MEMORY_ONLY" or
>>> > > > "MemoryOnly"). Personally, I like 4) with "MEMORY_ONLY". Any
>>> thoughts?
>>> > > >
>>> > > > Best,
>>> > > > Xiangrui
>>> > > >
>>> > > > -
>>> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > > > For additional commands, e-mail: dev-h...@spark.apache.org
>>> > > >
>>> > > >
>>> > >
>>> >
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

enum-like types in Spark

2015-03-04 Thread Xiangrui Meng

Hi all,

There are many places where we use enum-like types in Spark, but in
different ways. Every approach has both pros and cons. I wonder
whether there should be an “official” approach for enum-like types in
Spark.

1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)

* All types show up as Enumeration.Value in Java.
http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html

2. Java’s Enum (e.g., SaveMode, IOMode)

* Implementation must be in a Java file.
* Values doesn’t show up in the ScalaDoc:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode

3. Static fields in Java (e.g., TripletFields)

* Implementation must be in a Java file.
* Doesn’t need “()” in Java code.
* Values don't show up in the ScalaDoc:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields

4. Objects in Scala. (e.g., StorageLevel)

* Needs “()” in Java code.
* Values show up in both ScalaDoc and JavaDoc:
  
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
  
http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html

It would be great if we have an “official” approach for this as well
as the naming convention for enum-like values (“MEMORY_ONLY” or
“MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?

Best,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-03 Thread Xiangrui Meng

On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar  wrote:
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min
>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
> -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
> 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
> 1.2.x
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> But MSE has increased from 40.81 to 105.86. Has some refactoring happened
> on SGD/Linear Models ? Or do we have some extra parameters ? or change of
> defaults ?

Could you share the code you used? I don't remember any changes in
linear regression. Thanks! -Xiangrui

> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
>WSSSE has come down slightly
> 2.5. rdd operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lmbda) with itertools
> OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWIthSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 4.0. SQL from Python
> 4.1. result = sqlContext.sql("SELECT * from Employees WHERE State = 'WA'")
> OK
>
> Cheers
> 
>
> On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.3.0!
>>
>> The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc2/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> Staging repositories for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1074/
>> (published with version '1.3.0')
>> https://repository.apache.org/content/repositories/orgapachespark-1075/
>> (published with version '1.3.0-rc2')
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.3.0!
>>
>> The vote is open until Saturday, March 07, at 04:17 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == How does this compare to RC1 ==
>> This patch includes a variety of bug fixes found in RC1.
>>
>> == How can I help test this release? ==
>> If you are a Spark user, you can help us test this release by
>> taking a Spark 1.2 workload and running on this release candidate,
>> then reporting any regressions.
>>
>> If you are happy with this release based on your own testing, give a +1
>> vote.
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening towards the end of the 1.3 QA period,
>> so -1 votes should only occur for significant regressions from 1.2.1.
>> Bugs already present in 1.2.X, minor regressions, or bugs related
>> to new features will not block this release.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Xiangrui Meng

On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday  wrote:
> Also, check the JNILoader output.
>
> Remember, for netlib-java to use your system libblas all you need to do is
> setup libblas.so.3 like any native application would expect.
>
> I haven't ever used the cublas "real BLAS"  implementation, so I'd be
> interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
> that all the runtime links are in order.
>

There are two shared libraries in this hybrid setup. nvblas.so must be
loaded before libblas.so to intercept level 3 routines using GPU. More
details are at: http://docs.nvidia.com/cuda/nvblas/index.html#Usage

> Btw, I have some DGEMM wrappers in my netlib-java performance module... and
> I also planned to write more in MultiBLAS (until I mothballed the project
> for the hardware to catch up, which is probably has and now I just need a
> reason to look at it)
>
> On 27 Feb 2015 20:26, "Xiangrui Meng"  wrote:
>>
>> Hey Sam,
>>
>> The running times are not "big O" estimates:
>>
>> > The CPU version finished in 12 seconds.
>> > The CPU->GPU->CPU version finished in 2.2 seconds.
>> > The GPU version finished in 1.7 seconds.
>>
>> I think there is something wrong with the netlib/cublas combination.
>> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
>> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
>> through the CPU BLAS interface we need to use NVBLAS, which intercepts
>> some Level 3 CPU BLAS calls (including GEMM). So we need to load
>> nvblas.so first and then some CPU BLAS library in JNI. I wonder
>> whether the setup was correct.
>>
>> Alexander, could you check whether GPU is used in the netlib-cublas
>> experiments? You can tell it by watching CPU/GPU usage.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
>> wrote:
>> > Don't use "big O" estimates, always measure. It used to work back in the
>> > days when double multiplication was a bottleneck. The computation cost
>> > is
>> > effectively free on both the CPU and GPU and you're seeing pure copying
>> > costs. Also, I'm dubious that cublas is doing what you think it is. Can
>> > you
>> > link me to the source code for DGEMM?
>> >
>> > I show all of this in my talk, with explanations, I can't stress enough
>> > how
>> > much I recommend that you watch it if you want to understand high
>> > performance hardware acceleration for linear algebra :-)
>> >
>> > On 27 Feb 2015 01:42, "Xiangrui Meng"  wrote:
>> >>
>> >> The copying overhead should be quadratic on n, while the computation
>> >> cost is cubic on n. I can understand that netlib-cublas is slower than
>> >> netlib-openblas on small problems. But I'm surprised to see that it is
>> >> still 20x slower on 1x1. I did the following on a g2.2xlarge
>> >> instance with BIDMat:
>> >>
>> >> val n = 1
>> >>
>> >> val f = rand(n, n)
>> >> flip; f*f; val rf = flop
>> >>
>> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
>> >> flop
>> >>
>> >> flip; g*g; val rgg = flop
>> >>
>> >> The CPU version finished in 12 seconds.
>> >> The CPU->GPU->CPU version finished in 2.2 seconds.
>> >> The GPU version finished in 1.7 seconds.
>> >>
>> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> >> path. But based on the result, the data copying overhead is definitely
>> >> not as big as 20x at n = 1.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >>
>> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
>> >> wrote:
>> >> > I've had some email exchanges with the author of BIDMat: it does
>> >> > exactly
>> >> > what you need to get the GPU benefit and writes higher level
>> >> > algorithms
>> >> > entirely in the GPU kernels so that the memory stays there as long as
>> >> > possible. The restriction with this approach is that it is only
>> >> > offering
>> >> > high-level algorithms so is not a toolkit for applied mathematics
>> >> > research and development --- but it works well as a toolkit for
>> >> > higher
&g

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng

Hey Sam,

The running times are not "big O" estimates:

> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.

I think there is something wrong with the netlib/cublas combination.
Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
through the CPU BLAS interface we need to use NVBLAS, which intercepts
some Level 3 CPU BLAS calls (including GEMM). So we need to load
nvblas.so first and then some CPU BLAS library in JNI. I wonder
whether the setup was correct.

Alexander, could you check whether GPU is used in the netlib-cublas
experiments? You can tell it by watching CPU/GPU usage.

Best,
Xiangrui

On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday  wrote:
> Don't use "big O" estimates, always measure. It used to work back in the
> days when double multiplication was a bottleneck. The computation cost is
> effectively free on both the CPU and GPU and you're seeing pure copying
> costs. Also, I'm dubious that cublas is doing what you think it is. Can you
> link me to the source code for DGEMM?
>
> I show all of this in my talk, with explanations, I can't stress enough how
> much I recommend that you watch it if you want to understand high
> performance hardware acceleration for linear algebra :-)
>
> On 27 Feb 2015 01:42, "Xiangrui Meng"  wrote:
>>
>> The copying overhead should be quadratic on n, while the computation
>> cost is cubic on n. I can understand that netlib-cublas is slower than
>> netlib-openblas on small problems. But I'm surprised to see that it is
>> still 20x slower on 1x1. I did the following on a g2.2xlarge
>> instance with BIDMat:
>>
>> val n = 1
>>
>> val f = rand(n, n)
>> flip; f*f; val rf = flop
>>
>> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>>
>> flip; g*g; val rgg = flop
>>
>> The CPU version finished in 12 seconds.
>> The CPU->GPU->CPU version finished in 2.2 seconds.
>> The GPU version finished in 1.7 seconds.
>>
>> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> path. But based on the result, the data copying overhead is definitely
>> not as big as 20x at n = 1.
>>
>> Best,
>> Xiangrui
>>
>>
>> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
>> wrote:
>> > I've had some email exchanges with the author of BIDMat: it does exactly
>> > what you need to get the GPU benefit and writes higher level algorithms
>> > entirely in the GPU kernels so that the memory stays there as long as
>> > possible. The restriction with this approach is that it is only offering
>> > high-level algorithms so is not a toolkit for applied mathematics
>> > research and development --- but it works well as a toolkit for higher
>> > level analysis (e.g. for analysts and practitioners).
>> >
>> > I believe BIDMat's approach is the best way to get performance out of
>> > GPU hardware at the moment but I also have strong evidence to suggest
>> > that the hardware will catch up and the memory transfer costs between
>> > CPU/GPU will disappear meaning that there will be no need for custom GPU
>> > kernel implementations. i.e. please continue to use BLAS primitives when
>> > writing new algorithms and only go to the GPU for an alternative
>> > optimised implementation.
>> >
>> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
>> > an API that looks like BLAS but takes pointers to special regions in the
>> > GPU memory region. Somebody has written a wrapper around CUDA to create
>> > a proper BLAS library but it only gives marginal performance over the
>> > CPU because of the memory transfer overhead.
>> >
>> > This slide from my talk
>> >
>> >   http://fommil.github.io/scalax14/#/11/2
>> >
>> > says it all. X axis is matrix size, Y axis is logarithmic time to do
>> > DGEMM. Black line is the "cheating" time for the GPU and the green line
>> > is after copying the memory to/from the GPU memory. APUs have the
>> > potential to eliminate the green line.
>> >
>> > Best regards,
>> > Sam
>> >
>> >
>> >
>> > "Ulanov, Alexander"  writes:
>> >
>> >> Evan, thank you for the summary. I would like to add some more
>> >> observations. The GPU that I used is 2.5 times cheaper than the

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng

The copying overhead should be quadratic on n, while the computation
cost is cubic on n. I can understand that netlib-cublas is slower than
netlib-openblas on small problems. But I'm surprised to see that it is
still 20x slower on 1x1. I did the following on a g2.2xlarge
instance with BIDMat:

val n = 1

val f = rand(n, n)
flip; f*f; val rf = flop

flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

flip; g*g; val rgg = flop

The CPU version finished in 12 seconds.
The CPU->GPU->CPU version finished in 2.2 seconds.
The GPU version finished in 1.7 seconds.

I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
path. But based on the result, the data copying overhead is definitely
not as big as 20x at n = 1.

Best,
Xiangrui


On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday  wrote:
> I've had some email exchanges with the author of BIDMat: it does exactly
> what you need to get the GPU benefit and writes higher level algorithms
> entirely in the GPU kernels so that the memory stays there as long as
> possible. The restriction with this approach is that it is only offering
> high-level algorithms so is not a toolkit for applied mathematics
> research and development --- but it works well as a toolkit for higher
> level analysis (e.g. for analysts and practitioners).
>
> I believe BIDMat's approach is the best way to get performance out of
> GPU hardware at the moment but I also have strong evidence to suggest
> that the hardware will catch up and the memory transfer costs between
> CPU/GPU will disappear meaning that there will be no need for custom GPU
> kernel implementations. i.e. please continue to use BLAS primitives when
> writing new algorithms and only go to the GPU for an alternative
> optimised implementation.
>
> Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
> an API that looks like BLAS but takes pointers to special regions in the
> GPU memory region. Somebody has written a wrapper around CUDA to create
> a proper BLAS library but it only gives marginal performance over the
> CPU because of the memory transfer overhead.
>
> This slide from my talk
>
>   http://fommil.github.io/scalax14/#/11/2
>
> says it all. X axis is matrix size, Y axis is logarithmic time to do
> DGEMM. Black line is the "cheating" time for the GPU and the green line
> is after copying the memory to/from the GPU memory. APUs have the
> potential to eliminate the green line.
>
> Best regards,
> Sam
>
>
>
> "Ulanov, Alexander"  writes:
>
>> Evan, thank you for the summary. I would like to add some more observations. 
>> The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
>> both are 3 years old. I've also did a small test with modern hardware, and 
>> the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
>> than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
>> ($1200). My takeaway is that GPU is making a better price/value progress.
>>
>>
>>
>> Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
>> and the most reasonable explanation is that it holds the result in GPU 
>> memory, as Sam suggested. At the same time, it is OK because you can copy 
>> the result back from GPU only when needed. However, to be sure, I am going 
>> to ask the developer of BIDMat on his upcoming talk.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>> From: Sam Halliday [mailto:sam.halli...@gmail.com]
>> Sent: Thursday, February 26, 2015 1:56 PM
>> To: Xiangrui Meng
>> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Btw, I wish people would stop cheating when comparing CPU and GPU timings 
>> for things like matrix multiply :-P
>>
>> Please always compare apples with apples and include the time it takes to 
>> set up the matrices, send it to the processing unit, doing the calculation 
>> AND copying it back to where you need to see the results.
>>
>> Ignoring this method will make you believe that your GPU is thousands of 
>> times faster than it really is. Again, jump to the end of my talk for graphs 
>> and more discussion  especially the bit about me being keen on funding 
>> to investigate APU hardware further ;-) (I believe it will solve the problem)
>> On 26 Feb 2015 21:16, "Xiangrui Meng" 
>> mailto:men...@gmail.com>> wrote:
>> Hey Alexander,
>>
>> I don't quite understand the part where netlib-cublas is about 20x
>> slower t

Re: Google Summer of Code - ideas

2015-02-26 Thread Xiangrui Meng

There are couple things in Scala/Java but missing in Python API:

1. model import/export
2. evaluation metrics
3. distributed linear algebra
4. streaming algorithms

If you are interested, we can list/create target JIRAs and hunt them
down one by one.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 7:37 PM, Manoj Kumar
 wrote:
> Hi,
>
> I think that would be really good. Are there any specific issues that are to
> be implemented as per priority?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng

Hey Alexander,

I don't quite understand the part where netlib-cublas is about 20x
slower than netlib-openblas. What is the overhead of using a GPU BLAS
with netlib-java?

CC'ed Sam, the author of netlib-java.

Best,
Xiangrui

On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley  wrote:
> Better documentation for linking would be very helpful!  Here's a JIRA:
> https://issues.apache.org/jira/browse/SPARK-6019
>
>
> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
> wrote:
>
>> Thanks for compiling all the data and running these benchmarks, Alex. The
>> big takeaways here can be seen with this chart:
>>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>
>> 1) A properly configured GPU matrix multiply implementation (e.g.
>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> netlib-java+openblas-compiled).
>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>> than a well-tuned CPU implementation, particularly for larger matrices.
>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> basically agrees with the authors own benchmarks (
>> https://github.com/fommil/netlib-java)
>>
>> I think that most of our users are in a situation where using GPUs may not
>> be practical - although we could consider having a good GPU backend
>> available as an option. However, *ALL* users of MLlib could benefit
>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> implementation. Perhaps we should consider updating the mllib guide with a
>> more complete section for enabling high performance binaries on OSX and
>> Linux? Or better, figure out a way for the system to fetch these
>> automatically.
>>
>> - Evan
>>
>>
>>
>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com> wrote:
>>
>>> Just to summarize this thread, I was finally able to make all performance
>>> comparisons that we discussed. It turns out that:
>>> BIDMat-cublas>>BIDMat
>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>
>>> Below is the link to the spreadsheet with full results.
>>>
>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>
>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>> to/from machine’s RAM?
>>>
>>> -Original Message-
>>> From: Ulanov, Alexander
>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>> To: Evan R. Sparks
>>> Cc: Joseph Bradley; dev@spark.apache.org
>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>
>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>> original one discusses slightly different topic. I was able to link netlib
>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>> 60MB library.
>>>
>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>> +---+
>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>> |1,638475459 |
>>> |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
>>> 1569,233228 |
>>>
>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>> my machine. Probably, I’ll add two more columns with locally compiled
>>> openblas and cuda.
>>>
>>> Alexander
>>>
>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
>>> Sent: Monday, February 09, 2015 6:06 PM
>>> To: Ulanov, Alexander
>>> Cc: Joseph Bradley; dev@spark.apache.org
>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>
>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>
>>> It seems like this is going to be somewhat exploratory for a while (and
>>> there's probably only a handful of us who really care about fast linear
>>> algebra!)
>>>
>>> - Evan
>>>
>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>> Hi Evan,
>>>
>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>> link it with Netlib-java and perform benchmark again.
>>>
>>> Do I understand correctly that BIDMat binaries contain statically linked
>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
>>> because Intel sells this library. Nevertheless, it seems that in my case
>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>
>>> Though, it might be inter

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng

Made 3 votes to each of the talks. Looking forward to see them in
Hadoop Summit:) -Xiangrui

On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin  wrote:
> Hi all,
>
> The Hadoop Summit uses community choice voting to decide which talks to
> feature. It would be great if the community could help vote for Spark talks
> so that Spark has a good showing at this event. You can make three votes on
> each track. Below I've listed 3 talks that are important to Spark's
> roadmap. Please give 3 votes to each of the following talks.
>
> Committer Track: Lessons from Running Ultra Large Scale Spark Workloads on
> Hadoop
> https://hadoopsummit.uservoice.com/forums/283260-committer-track/suggestions/7074016
>
> Data Science track: DataFrames: large-scale data science on Hadoop data
> with Spark
> https://hadoopsummit.uservoice.com/forums/283261-data-science-and-hadoop/suggestions/7074147
>
> Future of Hadoop track: Online Approximate OLAP in SparkSQL
> https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074424
>
>
> Thanks!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Google Summer of Code - ideas

2015-02-24 Thread Xiangrui Meng

Would you be interested in working on MLlib's Python API during the
summer? We want everything we implemented in Scala can be used in both
Java and Python, but we are not there yet. It would be great if
someone is willing to help. -Xiangrui

On Sat, Feb 21, 2015 at 11:24 AM, Manoj Kumar
 wrote:
> Hello,
>
> I've been working on the Spark codebase for quite some time right now,
> especially on issues related to MLlib and a very small amount of PySpark
> and SparkSQL (https://github.com/apache/spark/pulls/MechCoder) .
>
> I would like to extend my work with Spark as a Google Summer of Code
> project.
> I want to know if there are specific projects related to MLlib that people
> would like to see. (I notice, there is no idea page for GSoC yet). There
> are a number of issues related to DecisionTrees, Ensembles, LDA (in the
> issue tracker) that I find really interesting that could probably club into
> a project, but if the spark community has anything else in mind, I could
> work on the other issues pre-GSoC and try out something new during GSoC.
>
> Looking forward!
> --
> Godspeed,
> Manoj Kumar,
> http://manojbits.wordpress.com
> 
> http://github.com/MechCoder

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Batch prediciton for ALS

2015-02-18 Thread Xiangrui Meng

Please create a JIRA for it and we should discuss the APIs first
before updating the code. -Xiangrui

On Tue, Feb 17, 2015 at 4:10 PM, Debasish Das  wrote:
> It will be really help us if we merge it but I guess it is already diverged
> from the new ALS...I will also take a look at it again and try update with
> the new ALS...
>
> On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng  wrote:
>>
>> It may be too late to merge it into 1.3. I'm going to make another
>> pass on your PR today. -Xiangrui
>>
>> On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > Will it be possible to merge this PR to 1.3 ?
>> >
>> > https://github.com/apache/spark/pull/3098
>> >
>> > The batch prediction API in ALS will be useful for us who want to cross
>> > validate on prec@k and MAP...
>> >
>> > Thanks.
>> > Deb
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng

It may be too late to merge it into 1.3. I'm going to make another
pass on your PR today. -Xiangrui

On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das  wrote:
> Hi,
>
> Will it be possible to merge this PR to 1.3 ?
>
> https://github.com/apache/spark/pull/3098
>
> The batch prediction API in ALS will be useful for us who want to cross
> validate on prec@k and MAP...
>
> Thanks.
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng

The current ALS implementation allow pluggable solvers for
NormalEquation, where we put CholeskeySolver and NNLS solver. Please
check the current implementation and let us know how your constraint
solver would fit. For a general matrix factorization package, let's
make a JIRA and move our discussion there. -Xiangrui

On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das  wrote:
> Hi,
>
> I am bit confused on the mllib design in the master. I thought that core
> algorithms will stay in mllib and ml will define the pipelines over the
> core algorithm but looks like in master ALS is moved from mllib to ml...
>
> I am refactoring my PR to a factorization package and I want to build it on
> top of ml.recommendation.ALS (possibly extend from ml.recommendation.ALS
> since first version will use very similar RDD handling as ALS and a
> proximal solver that's being added to breeze)
>
> https://issues.apache.org/jira/browse/SPARK-2426
> https://github.com/scalanlp/breeze/pull/321
>
> Basically I am not sure if we should merge it with recommendation.ALS since
> this is more generic than recommendation. I am considering calling it
> ConstrainedALS where user can specify different constraint for user and
> product factors (Similar to GraphLab CF structure).
>
> I am also working on ConstrainedALM where the underlying algorithm is no
> longer ALS but nonlinear alternating minimization with constraints.
> https://github.com/scalanlp/breeze/pull/364
> This will let us do large rank matrix completion where there is no need to
> construct gram matrices. I will open up the JIRA soon after getting initial
> results
>
> I am bit confused that where should I add the factorization package. It
> will use the current ALS test-cases and I have to construct more test-cases
> for sparse coding and PLSA formulations.
>
> Thanks.
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng

There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them, but do the split for each fold. Then we need to cache 3*3
times. Note that the pipeline API is not yet optimized for
performance. It would be nice to optimize its perforamnce in 1.4.
-Xiangrui

On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko  wrote:
> Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> comment out instances.unpersist()
>
> |override  def  fit(dataset:SchemaRDD,
> paramMap:ParamMap):LogisticRegressionModel  = {
> println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> ParamMap $paramMap.")
> transformSchema(dataset.schema, paramMap, logging =true)
> import  dataset.sqlContext._
> val  map  =  this.paramMap ++ paramMap
> val  instances  =  dataset.select(map(labelCol).attr,
> map(featuresCol).attr)
>   .map {
> case  Row(label:Double, features:Vector) =>
>   LabeledPoint(label, features)
>   }
>
> if  (instances.getStorageLevel ==StorageLevel.NONE) {
>   println("Instances not persisted")
>   instances.persist(StorageLevel.MEMORY_AND_DISK)
> }
>
>  val  lr  =  (new  LogisticRegressionWithLBFGS)
>   .setValidateData(false)
>   .setIntercept(true)
> lr.optimizer
>   .setRegParam(map(regParam))
>   .setNumIterations(map(maxIter))
> val  lrm  =  new  LogisticRegressionModel(this, map,
> lr.run(instances).weights)
> //instances.unpersist()
> // copy model params
> Params.inheritValues(map,this, lrm)
> lrm
>   }
> |
>
> CrossValidator feeds the same SchemaRDD for each parameter (same hash code),
> but somewhere cache being flushed. The memory is enough. Here’s the output:
>
> |Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.001
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> |
>
> I have 3 parameters in GridSearch and 3 folds for CrossValidation:
>
> |
> val  paramGrid  =  new  ParamGridBuilder()
>   .addGrid(model.regParam,Array(0.1,0.01,0.001))
>   .build()
>
> crossval.setEstimatorParamMaps(paramGrid)
> crossval.setNumFolds(3)
> |
>
> I assume that the data should be read and cached 3 times (1 to
> numFolds).combinations(2) and be independent from number of parameters. But
> i have 9 times data being read and cached.
>
> Thanks,
> Peter Rudenko
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng

Btw, I think allowing `/* ... */` without the leading `*` in lines is
also useful. Check this line:
https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55,
where we put the R commands that can reproduce the test result. It is
easier if we write in the following style:

~~~
/*
 Using the following R code to load the data and train the model using
glmnet package.

 library("glmnet")
 data <- read.csv("path", header=FALSE, stringsAsFactors=FALSE)
 features <- as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3)))
 label <- as.numeric(data$V1)
 weights <- coef(glmnet(features, label, family="gaussian", alpha = 0,
lambda = 0))
 */
~~~

So people can copy & paste the R commands directly.

Xiangrui

On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng  wrote:
> I like the `/* .. */` style more. Because it is easier for IDEs to
> recognize it as a block comment. If you press enter in the comment
> block with the `//` style, IDEs won't add `//` for you. -Xiangrui
>
> On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin  wrote:
>> We should update the style doc to reflect what we have in most places
>> (which I think is //).
>>
>>
>>
>> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> FWIW I like the multi-line // over /* */ from a purely style standpoint.
>>> The Google Java style guide[1] has some comment about code formatting tools
>>> working better with /* */ but there doesn't seem to be any strong arguments
>>> for one over the other I can find
>>>
>>> Thanks
>>> Shivaram
>>>
>>> [1]
>>>
>>> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
>>>
>>> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
>>> wrote:
>>>
>>> > Personally I have no opinion, but agree it would be nice to standardize.
>>> >
>>> > - Patrick
>>> >
>>> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
>>> > > One thing Marcelo pointed out to me is that the // style does not
>>> > > interfere with commenting out blocks of code with /* */, which is a
>>> > > small good thing. I am also accustomed to // style for multiline, and
>>> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
>>> > > inline always looks a little funny to me.
>>> > >
>>> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
>>> kayousterh...@gmail.com>
>>> > wrote:
>>> > >> Hi all,
>>> > >>
>>> > >> The Spark Style Guide
>>> > >> <
>>> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
>>> >
>>> > >> says multi-line comments should formatted as:
>>> > >>
>>> > >> /*
>>> > >>  * This is a
>>> > >>  * very
>>> > >>  * long comment.
>>> > >>  */
>>> > >>
>>> > >> But in my experience, we almost always use "//" for multi-line
>>> comments:
>>> > >>
>>> > >> // This is a
>>> > >> // very
>>> > >> // long comment.
>>> > >>
>>> > >> Here are some examples:
>>> > >>
>>> > >>- Recent commit by Reynold, king of style:
>>> > >>
>>> >
>>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>>> > >>- RDD.scala:
>>> > >>
>>> >
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>>> > >>- DAGScheduler.scala:
>>> > >>
>>> >
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>>> > >>
>>> > >>
>>> > >> Any objections to me updating the style guide to reflect this?  As
>>> with
>>> > >> other style issues, I think consistency here is helpful (and
>>> formatting
>>> > >> multi-line comments as "//" does nicely visually distinguish code
>>> > comments
>>> > >> from doc comments).
>>> > >>
>>> > >> -Kay
>>> > >
>>> > > -
>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>>> > >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng

I like the `/* .. */` style more. Because it is easier for IDEs to
recognize it as a block comment. If you press enter in the comment
block with the `//` style, IDEs won't add `//` for you. -Xiangrui

On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin  wrote:
> We should update the style doc to reflect what we have in most places
> (which I think is //).
>
>
>
> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> FWIW I like the multi-line // over /* */ from a purely style standpoint.
>> The Google Java style guide[1] has some comment about code formatting tools
>> working better with /* */ but there doesn't seem to be any strong arguments
>> for one over the other I can find
>>
>> Thanks
>> Shivaram
>>
>> [1]
>>
>> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
>>
>> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
>> wrote:
>>
>> > Personally I have no opinion, but agree it would be nice to standardize.
>> >
>> > - Patrick
>> >
>> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
>> > > One thing Marcelo pointed out to me is that the // style does not
>> > > interfere with commenting out blocks of code with /* */, which is a
>> > > small good thing. I am also accustomed to // style for multiline, and
>> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
>> > > inline always looks a little funny to me.
>> > >
>> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
>> kayousterh...@gmail.com>
>> > wrote:
>> > >> Hi all,
>> > >>
>> > >> The Spark Style Guide
>> > >> <
>> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
>> >
>> > >> says multi-line comments should formatted as:
>> > >>
>> > >> /*
>> > >>  * This is a
>> > >>  * very
>> > >>  * long comment.
>> > >>  */
>> > >>
>> > >> But in my experience, we almost always use "//" for multi-line
>> comments:
>> > >>
>> > >> // This is a
>> > >> // very
>> > >> // long comment.
>> > >>
>> > >> Here are some examples:
>> > >>
>> > >>- Recent commit by Reynold, king of style:
>> > >>
>> >
>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>> > >>- RDD.scala:
>> > >>
>> >
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>> > >>- DAGScheduler.scala:
>> > >>
>> >
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>> > >>
>> > >>
>> > >> Any objections to me updating the style guide to reflect this?  As
>> with
>> > >> other style issues, I think consistency here is helpful (and
>> formatting
>> > >> multi-line comments as "//" does nicely visually distinguish code
>> > comments
>> > >> from doc comments).
>> > >>
>> > >> -Kay
>> > >
>> > > -
>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>> > >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: IDF for ml pipeline

2015-02-03 Thread Xiangrui Meng

Yes, we need a wrapper under spark.ml. Feel free to create a JIRA for
it. -Xiangrui

On Mon, Feb 2, 2015 at 8:56 PM, masaki rikitoku  wrote:
> Hi all
>
> I am trying the ml pipeline for text classfication now.
>
> recently, i succeed to execute the pipeline processing in ml packages,
> which consist of the original Japanese tokenizer, hashingTF,
> logisticRegression.
>
> then,  i failed to  executed the pipeline with idf in mllib package directly.
>
> To use the idf feature in ml package,
> do i have to implement the wrapper for idf in ml package like the hashingTF?
>
> best
>
> Masaki Rikitoku
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Maximum size of vector that reduce can handle

2015-01-27 Thread Xiangrui Meng

60m-vector costs 480MB memory. You have 12 of them to be reduced to the
driver. So you need ~6GB memory not counting the temp vectors generated
from '_+_'. You need to increase driver memory to make it work. That being
said, ~10^7 hits the limit for the current impl of glm. -Xiangrui
On Jan 23, 2015 2:19 PM, "DB Tsai"  wrote:

> Hi Alexander,
>
> For `reduce`, it's an action that will collect all the data from
> mapper to driver, and perform the aggregation in driver. As a result,
> if the output from the mapper is very large, and the numbers of
> partitions in mapper are large, it might cause a problem.
>
> For `treeReduce`, as the name indicates, the way it works is in the
> first layer, it aggregates the output of the mappers two by two
> resulting half of the numbers of output. And then, we continuously do
> the aggregation layer by layer. The final aggregation will be done in
> driver but in this time, the numbers of data are small.
>
> By default, depth 2 is used, so if you have so many partitions of
> large vector, this may still cause issue. You can increase the depth
> into higher numbers such that in the final reduce in driver, the
> number of partitions are very small.
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
>
> On Fri, Jan 23, 2015 at 12:07 PM, Ulanov, Alexander
>  wrote:
> > Hi DB Tsai,
> >
> > Thank you for your suggestion. Actually, I've started my experiments
> with "treeReduce". Originally, I had "vv.treeReduce(_ + _, 2)" in my script
> exactly because MLlib optimizers are using it, as you pointed out with
> LBFGS. However, it leads to the same problems as "reduce", but presumably
> not so directly. As far as I understand, treeReduce limits the number of
> communications between workers and master forcing workers to partially
> compute the reduce operation.
> >
> > Are you sure that driver will first collect all results (or all partial
> results in treeReduce) and ONLY then perform aggregation? If that is the
> problem, then how to force it to do aggregation after receiving each
> portion of data from Workers?
> >
> > Best regards, Alexander
> >
> > -Original Message-
> > From: DB Tsai [mailto:dbt...@dbtsai.com]
> > Sent: Friday, January 23, 2015 11:53 AM
> > To: Ulanov, Alexander
> > Cc: dev@spark.apache.org
> > Subject: Re: Maximum size of vector that reduce can handle
> >
> > Hi Alexander,
> >
> > When you use `reduce` to aggregate the vectors, those will actually be
> pulled into driver, and merged over there. Obviously, it's not scaleable
> given you are doing deep neural networks which have so many coefficients.
> >
> > Please try treeReduce instead which is what we do in linear regression
> and logistic regression.
> >
> > See
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala
> > for example.
> >
> > val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n),
> 0.0))( seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
> features)) => val l = localGradient.compute( features, label, bcW.value,
> grad) (grad, loss + l) }, combOp = (c1, c2) => (c1, c2) match { case
> ((grad1, loss1), (grad2, loss2)) => axpy(1.0, grad2, grad1) (grad1, loss1 +
> loss2)
> > })
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> >
> > On Fri, Jan 23, 2015 at 10:00 AM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
> >> Dear Spark developers,
> >>
> >> I am trying to measure the Spark reduce performance for big vectors. My
> motivation is related to machine learning gradient. Gradient is a vector
> that is computed on each worker and then all results need to be summed up
> and broadcasted back to workers. For example, present machine learning
> applications involve very long parameter vectors, for deep neural networks
> it can be up to 2Billions. So, I want to measure the time that is needed
> for this operation depending on the size of vector and number of workers. I
> wrote few lines of code that assume that Spark will distribute partitions
> among all available workers. I have 6-machine cluster (Xeon 3.3GHz 4 cores,
> 16GB RAM), each runs 2 Workers.
> >>
> >> import org.apache.spark.mllib.rdd.RDDFunctions._
> >> import breeze.linalg._
> >> import org.apache.log4j._
> >> Logger.getRootLogger.setLevel(Level.OFF)
> >> val n = 6000
> >> val p = 12
> >> val vv = sc.parallelize(0 until p, p).map(i =>
> >> DenseVector.rand[Double]( n )) vv.reduce(_ + _)
> >>
> >> When executing in shell with 60M vector it crashes after some period of
> time. One of the node contains the following in stdout:
> >> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> >> os::commit_memory(0x00075550, 2863661056, 0) failed;
> >> error='Cannot allocate memory' (errno=1

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Xiangrui Meng

I would call it Scaler. You might want to add it to the spark.ml pipieline
api. Please check the spark.ml.HashingTF implementation. Note that this
should handle sparse vectors efficiently.

Hadamard and FFTs are quite useful. If you are intetested, make sure that
we call an FFT libary that is license-compatible with Apache.

-Xiangrui
On Jan 24, 2015 8:27 AM, "Octavian Geagla"  wrote:

> Hello,
>
> I found it useful to implement the  Hadamard Product
> 
>  as
> a VectorTransformer.  It can be applied to scale (by a constant) a certain
> dimension (column) of the data set.
>
> Since I've already implemented it and am using it, I thought I'd see if
> there's interest in this feature going in as Experimental.  I'm not sold on
> the name 'Weighter', either.
>
> Here's the current branch with the work (docs, impl, tests).
> 
>
> The implementation was heavily inspired by those of StandardScalerModel and
> Normalizer.
>
> Thanks
> Octavian
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-tp10265.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng

For large datasets, you need hashing in order to compute k-nearest
neighbors locally. You can start with LSH + k-nearest in Google
scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui

On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S.  wrote:
> Hi all,
>
> Please help me to find out best way for K-nearest neighbor using spark for
> large data sets.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spectral clustering

2015-01-20 Thread Xiangrui Meng

Fan and Stephen (cc'ed) are working on this feature. They will update
the JIRA page and report progress soon. -Xiangrui

On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman
 wrote:
> Hi, thinking of picking up this Jira ticket:
> https://issues.apache.org/jira/browse/SPARK-4259
>
> Anyone done any work on this to date?  Any thoughts on it before we go too
> far in?
>
> Thanks!
>
> Best
> Andrew

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: DBSCAN for MLlib

2015-01-14 Thread Xiangrui Meng

Please find my comments on the JRIA page. -Xiangrui

On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby
 wrote:
> I have to say, I have created a Jira task for it:
> [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRA
>
> |   |
> |   |   |   |   |   |
> | [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRAMLlib is 
> all k-means now, and I think we should add some new clustering algorithms to 
> it. First candidate is DBSCAN as I think.  |
> |  |
> | View on issues.apache.org | Preview by Yahoo |
> |  |
> |   |
>
>
>
>  On Wednesday, January 14, 2015 1:09 AM, Muhammad Ali A'råby 
>  wrote:
>
>
>  Dear all,
> I think MLlib needs more clustering algorithms and DBSCAN is my first 
> candidate. I am starting to implement it. Any advice?
> Muhammad-Ali
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Re-use scaling means and variances from StandardScalerModel

2015-01-09 Thread Xiangrui Meng

Feel free to create a JIRA for this issue. We might need to discuss
what to put in the public constructors. In the meanwhile, you can use
Java serialization to save/load the model:

sc.parallelize(Seq(model), 1).saveAsObjectFile("/tmp/model")
val model = sc.objectFile[StandardScalerModel]("/tmp/model").first()

-Xiangrui

On Fri, Jan 9, 2015 at 12:21 PM, ogeagla  wrote:
> Hello,
>
> I would like to re-use the means and variances computed by the fit function
> in the StandardScaler, as I persist them and my use case requires consisted
> scaling of data based on some initial data set.  The StandardScalerModel's
> constructor takes means and variances, but is private[mllib].
> Forking/compiling Spark or copy/pasting the class into my project are both
> options, but  I'd like to stay away from them.  Any chance there is interest
> in a PR to allow this re-use via removal of private from the the
> constructor?  Or perhaps an alternative solution exists?
>
> Thanks,
> Octavian
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-tp10073.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng

Dear Spark users and developers,

I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install packages
for any version of Spark, and makes it easy for developers to
contribute packages.

Spark Packages will feature integrations with various data sources,
management tools, higher level domain-specific libraries, machine
learning algorithms, code samples, and other Spark content. Thanks to
the package authors, the initial listing of packages includes
scientific computing libraries, a job execution server, a connector
for importing Avro data, tools for launching Spark on Google Compute
Engine, and many others.

I’d like to invite you to contribute and use Spark Packages and
provide feedback! As a disclaimer: Spark Packages is a community index
maintained by Databricks and (by design) will include packages outside
of the ASF Spark project. We are excited to help showcase and support
all of the great work going on in the broader Spark community!

Cheers,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

1 2 3 >

1 - 100 of 204 matches

Mail list logo