from:"Tim Hunter"

Re: eager execution and debuggability

2018-05-09 Thread Tim Hunter

The repr() trick is neat when working on a notebook. When working in a
library, I used to use an evaluate(dataframe) -> DataFrame function that
simply forces the materialization of a dataframe. As Reynold mentions, this
is very convenient when working on a lot of chained UDFs, and it is a
standard trick in lazy environments and languages.

Tim

On Wed, May 9, 2018 at 3:26 AM, Reynold Xin  wrote:

> Yes would be great if possible but it’s non trivial (might be impossible
> to do in general; we already have stacktraces that point to line numbers
> when an error occur in UDFs but clearly that’s not sufficient). Also in
> environments like REPL it’s still more useful to show error as soon as it
> occurs, rather than showing it potentially 30 lines later.
>
> On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This may be technically impractical, but it would be fantastic if we
>> could make it easier to debug Spark programs without needing to rely on
>> eager execution. Sprinkling .count() and .checkpoint() at various points
>> in my code is still a debugging technique I use, but it always makes me
>> wish Spark could point more directly to the offending transformation when
>> something goes wrong.
>>
>> Is it somehow possible to have each individual operator (is that the
>> correct term?) in a DAG include metadata pointing back to the line of code
>> that generated the operator? That way when an action triggers an error, the
>> failing operation can point to the relevant line of code — even if it’s a
>> transformation — and not just the action on the tail end that triggered the
>> error.
>>
>> I don’t know how feasible this is, but addressing it would directly solve
>> the issue of linking failures to the responsible transformation, as opposed
>> to leaving the user to break up a chain of transformations with several
>> debug actions. And this would benefit new and experienced users alike.
>>
>> Nick
>>
>> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
>> 님이 작성:
>>
>> I've opened SPARK-24215 to track this.
>>>
>>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin  wrote:
>>>
 Yup. Sounds great. This is something simple Spark can do and provide
 huge value to the end users.


 On Tue, May 8, 2018 at 3:53 PM Ryan Blue  wrote:

> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior
> to PySpark classes. We could also add a configuration property to 
> determine
> whether the dataset evaluation is eager or not. That would make it 
> turn-key
> for anyone running PySpark in Jupyter.
>
> For JVM languages, we could also add a dependency on jvm-repr and do
> the same thing.
>
> rb
> 
>
> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin 
> wrote:
>
>> s/underestimated/overestimated/
>>
>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin 
>> wrote:
>>
>>> Marco,
>>>
>>> There is understanding how Spark works, and there is finding bugs
>>> early in their own program. One can perfectly understand how Spark works
>>> and still find it valuable to get feedback asap, and that's why we built
>>> eager analysis in the first place.
>>>
>>> Also I'm afraid you've significantly underestimated the level of
>>> technical sophistication of users. In many cases they struggle to get
>>> anything to work, and performance optimization of their programs is
>>> secondary to getting things working. As John Ousterhout says, "the 
>>> greatest
>>> performance improvement of all is when a system goes from not-working to
>>> working".
>>>
>>> I really like Ryan's approach. Would be great if it is something
>>> more turn-key.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido 
>>> wrote:
>>>
 I am not sure how this is useful. For students, it is important to
 understand how Spark works. This can be critical in many decision they 
 have
 to take (whether and what to cache for instance) in order to have
 performant Spark application. Creating a eager execution probably can 
 help
 them having something running more easily, but let them also using 
 Spark
 knowing less about how it works, thus they are likely to write worse
 application and to have more problems in debugging any kind of problem
 which may later (in production) occur (therefore affecting their 
 experience
 with the tool).

 Moreover, as Ryan also mentioned, there are tools/ways to force the
 execution, helping in the debugging phase. So they can

[ml] Deep learning talks at the Spark Summit Europe

2017-10-10 Thread Tim Hunter

Hello all,
following the last Summit, there will be a couple of exciting talks
about deep learning and Spark at the next Spark Summit in Dublin.
 - Deep Dive Into Deep Learning Pipelines, in which we will go even
deeper into the technical aspects for an hour-long session
 - Apache Spark and TensorFlow as a service, by Jim Dowling

If you have not gotten your ticket yet, there is still time! You can
use the promo code DatabricksEU for a 15% discount.

Looking forward to meeting the dev community on the East side of the Atlantic.

Tim

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-28 Thread Tim Hunter

Thank you everyone for the comments and the votes. We will follow up
shortly with a pull request.

On Wed, Sep 27, 2017 at 6:32 PM, Joseph Bradley <jos...@databricks.com>
wrote:

> This vote passes with 11 +1s (4 binding) and no +0s or -1s.
>
> +1:
> Sean Owen (binding)
> Holden Karau
> Denny Lee
> Reynold Xin (binding)
> Joseph Bradley (binding)
> Noman Khan
> Weichen Xu
> Yanbo Liang
> Dongjoon Hyun
> Matei Zaharia (binding)
> Vaquar Khan
>
> Thanks everyone!
> Joseph
>
> On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan <vaquar.k...@gmail.com>
> wrote:
>
>> +1 looks good,
>>
>> Regards,
>> Vaquar khan
>>
>> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> +1; we should consider something similar for multi-dimensional tensors
>>> too.
>>>
>>> Matei
>>>
>>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>> >
>>> > +1
>>> >
>>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbp...@live.com>
>>> wrote:
>>> > +1
>>> >
>>> > Regards
>>> > Noman
>>> > From: Denny Lee <denny.g@gmail.com>
>>> > Sent: Friday, September 22, 2017 2:59:33 AM
>>> > To: Apache Spark Dev; Sean Owen; Tim Hunter
>>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
>>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>>> >
>>> > +1
>>> >
>>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <so...@cloudera.com> wrote:
>>> > Am I right that this doesn't mean other packages would use this
>>> representation, but that they could?
>>> >
>>> > The representation looked fine to me w.r.t. what DL frameworks need.
>>> >
>>> > My previous comment was that this is actually quite lightweight. It's
>>> kind of like how I/O support is provided for CSV and JSON, so makes enough
>>> sense to add to Spark. It doesn't really preclude other solutions.
>>> >
>>> > For those reasons I think it's fine. +1
>>> >
>>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhun...@databricks.com>
>>> wrote:
>>> > Hello community,
>>> >
>>> > I would like to call for a vote on SPARK-21866. It is a short proposal
>>> that has important applications for image processing and deep learning.
>>> Joseph Bradley has offered to be the shepherd.
>>> >
>>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>>> > PDF version: https://issues.apache.org/jira
>>> /secure/attachment/12884792/SPIP%20-%20Image%20support%20for
>>> %20Apache%20Spark%20V1.1.pdf
>>> >
>>> > Background and motivation
>>> > As Apache Spark is being used more and more in the industry, some new
>>> use cases are emerging for different data formats beyond the traditional
>>> SQL types or the numerical types (vectors and matrices). Deep Learning
>>> applications commonly deal with image processing. A number of projects add
>>> some Deep Learning capabilities to Spark (see list below), but they
>>> struggle to communicate with each other or with MLlib pipelines because
>>> there is no standard way to represent an image in Spark DataFrames. We
>>> propose to federate efforts for representing images in Spark by defining a
>>> representation that caters to the most common needs of users and library
>>> developers.
>>> > This SPIP proposes a specification to represent images in Spark
>>> DataFrames and Datasets (based on existing industrial standards), and an
>>> interface for loading sources of images. It is not meant to be a
>>> full-fledged image processing library, but rather the core description that
>>> other libraries and users can rely on. Several packages already offer
>>> various processing facilities for transforming images or doing more complex
>>> operations, and each has various design tradeoffs that make them better as
>>> standalone solutions.
>>> > This project is a joint collaboration between Microsoft and
>>> Databricks, which have been testing this design in two open source
>>> packages: MMLSpark and Deep Learning Pipelines.
>>> > The proposed image format is an in-memory, decompressed representation
>>> that targets low-level applications. It is significantly more liberal

[VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-21 Thread Tim Hunter

Hello community,

I would like to call for a vote on SPARK-21866. It is a short proposal that
has important applications for image processing and deep learning. Joseph
Bradley has offered to be the shepherd.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
PDF version: https://issues.apache.org/jira/secure/attachment/12884792/SPIP
%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf

Background and motivation

As Apache Spark is being used more and more in the industry, some new use
cases are emerging for different data formats beyond the traditional SQL
types or the numerical types (vectors and matrices). Deep Learning
applications commonly deal with image processing. A number of projects add
some Deep Learning capabilities to Spark (see list below), but they
struggle to communicate with each other or with MLlib pipelines because
there is no standard way to represent an image in Spark DataFrames. We
propose to federate efforts for representing images in Spark by defining a
representation that caters to the most common needs of users and library
developers.

This SPIP proposes a specification to represent images in Spark DataFrames
and Datasets (based on existing industrial standards), and an interface for
loading sources of images. It is not meant to be a full-fledged image
processing library, but rather the core description that other libraries
and users can rely on. Several packages already offer various processing
facilities for transforming images or doing more complex operations, and
each has various design tradeoffs that make them better as standalone
solutions.

This project is a joint collaboration between Microsoft and Databricks,
which have been testing this design in two open source packages: MMLSpark
and Deep Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that
targets low-level applications. It is significantly more liberal in memory
usage than compressed image representations such as JPEG, PNG, etc., but it
allows easy communication with popular image processing libraries and has
no decoding overhead.
Targets users and personas:

Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing
images, and will gain from a common interchange format (in alphabetical
order):

   - BigDL
   - DeepLearning4J
   - Deep Learning Pipelines
   - MMLSpark
   - TensorFlow (Spark connector)
   - TensorFlowOnSpark
   - TensorFrames
   - Thunder

Goals:

   - Simple representation of images in Spark DataFrames, based on
   pre-existing industrial standards (OpenCV)
   - This format should eventually allow the development of
   high-performance integration points with image processing libraries such as
   libOpenCV, Google TensorFlow, CNTK, and other C libraries.
   - The reader should be able to read popular formats of images from
   distributed sources.

Non-Goals:

Images are a versatile medium and encompass a very wide range of formats
and representations. This SPIP explicitly aims at the most common use case
in the industry currently: multi-channel matrices of binary, int32, int64,
float or double data that can fit comfortably in the heap of the JVM:

   - the total size of an image should be restricted to less than 2GB
   (roughly)
   - the meaning of color channels is application-specific and is not
   mandated by the standard (in line with the OpenCV standard)
   - specialized formats used in meteorology, the medical field, etc. are
   not supported
   - this format is specialized to images and does not attempt to solve the
   more general problem of representing n-dimensional tensors in Spark

Proposed API changes

We propose to add a new package in the package structure, under the MLlib
project:
org.apache.spark.image
Data format

We propose to add the following structure:

imageSchema = StructType([

   - StructField("mode", StringType(), False),
  - The exact representation of the data.
  - The values are described in the following OpenCV convention.
  Basically, the type has both "depth" and "number of channels" info: in
  particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format
  would be CV_8UC4 (value 32 in the table) with the channel order specified
  by convention.
  - The exact channel ordering and meaning of each channel is dictated
  by convention. By default, the order is RGB (3 channels) and BGRA (4
  channels).
  If the image failed to load, the value is the empty string "".


   - StructField("origin", StringType(), True),
  - Some information about the origin of the image. The content of this
  is application-specific.
  - When the image is loaded from files, users should expect to find
  the file name in this field.


   - StructField("height", IntegerType(), False),
  - the height of the image, pixels
  - If the image fails to load, the value is -1.


   - StructField("width",

SPIP: SPARK-21866 Image support in Apache Spark

2017-09-05 Thread Tim Hunter

Hello community,

I would like to start a discussion about adding support for images in
Spark. We will follow up with a formal vote in two weeks. Please feel free
to comment on the JIRA ticket too.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
PDF version:
https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf

Background and motivation

As Apache Spark is being used more and more in the industry, some new use
cases are emerging for different data formats beyond the traditional SQL
types or the numerical types (vectors and matrices). Deep Learning
applications commonly deal with image processing. A number of projects add
some Deep Learning capabilities to Spark (see list below), but they
struggle to communicate with each other or with MLlib pipelines because
there is no standard way to represent an image in Spark DataFrames. We
propose to federate efforts for representing images in Spark by defining a
representation that caters to the most common needs of users and library
developers.

This SPIP proposes a specification to represent images in Spark DataFrames
and Datasets (based on existing industrial standards), and an interface for
loading sources of images. It is not meant to be a full-fledged image
processing library, but rather the core description that other libraries
and users can rely on. Several packages already offer various processing
facilities for transforming images or doing more complex operations, and
each has various design tradeoffs that make them better as standalone
solutions.

This project is a joint collaboration between Microsoft and Databricks,
which have been testing this design in two open source packages: MMLSpark
and Deep Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that
targets low-level applications. It is significantly more liberal in memory
usage than compressed image representations such as JPEG, PNG, etc., but it
allows easy communication with popular image processing libraries and has
no decoding overhead.
Targets users and personas:

Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing
images, and will gain from a common interchange format (in alphabetical
order):

- BigDL
- DeepLearning4J
- Deep Learning Pipelines
- MMLSpark
- TensorFlow (Spark connector)
- TensorFlowOnSpark
- TensorFrames
- Thunder

Goals:

- Simple representation of images in Spark DataFrames, based on
pre-existing industrial standards (OpenCV)
- This format should eventually allow the development of
high-performance integration points with image processing libraries such as
libOpenCV, Google TensorFlow, CNTK, and other C libraries.
- The reader should be able to read popular formats of images from
distributed sources.

Non-Goals:

Images are a versatile medium and encompass a very wide range of formats
and representations. This SPIP explicitly aims at the most common use case
in the industry currently: multi-channel matrices of binary, int32, int64,
float or double data that can fit comfortably in the heap of the JVM:

- the total size of an image should be restricted to less than 2GB
(roughly)
- the meaning of color channels is application-specific and is not
mandated by the standard (in line with the OpenCV standard)
- specialized formats used in meteorology, the medical field, etc. are
not supported
- this format is specialized to images and does not attempt to solve the
more general problem of representing n-dimensional tensors in Spark

Proposed API changes

We propose to add a new package in the package structure, under the MLlib
project:
org.apache.spark.image
Data format

We propose to add the following structure:

imageSchema = StructType([

- StructField("mode", StringType(), False),
- The exact representation of the data.
- The values are described in the following OpenCV convention.
Basically, the type has both "depth" and "number of channels" info: in
particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format
would be CV_8UC4 (value 32 in the table) with the channel order specified
by convention.
- The exact channel ordering and meaning of each channel is dictated
by convention. By default, the order is RGB (3 channels) and BGRA (4
channels).
If the image failed to load, the value is the empty string "".

- StructField("origin", StringType(), True),
- Some information about the origin of the image. The content of this
is application-specific.
- When the image is loaded from files, users should expect to find
the file name in this field.

- StructField("height", IntegerType(), False),
- the height of the image, pixels
- If the image fails to load, the value is -1.

- StructField("width",

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Tim Hunter

Hello Enzo,

since this question is also relevant to Spark, I will answer it here. The
goal of GraphFrames is to provide graph capabilities along with excellent
integration to the rest of the Spark ecosystem (using modern APIs such as
DataFrames). As you seem to be well aware, a large number of graph
algorithms can be implemented in terms of a small subset of graph
primitives. These graph primitives can be translated to Spark operations,
but we feel that some important low-level optimizations should be added to
the Catalyst engine in order to realize the true potential of GraphFrames.
You can find a flavor of this work in this presentation of Ankur Dave [1].
This is still an area of collaboration with the Spark core team, and we
would like to merge GraphFrames in Spark 2.x eventually.

Where does it leave us for the time being? GraphFrames is actively
supported, and we implemented a highly scalable version of GraphFrames in
November. As you mentioned, there are a number of distributed Graph
frameworks out there, but to my knowledge they are not as easy to integrate
with Spark. The current approach has been to reach parity with GraphX first
and then add new algorithms based on popular demand. Along these lines,
GraphBLAS could be added on top of it if someone is willing to step up.

Tim

[1]
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/

On Mon, Mar 13, 2017 at 2:58 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Since GraphFrames is not part of the Spark project, your
> GraphFrames-specific questions are probably better directed at the
> GraphFrames issue tracker:
>
> https://github.com/graphframes/graphframes/issues
>
> As far as I know, GraphFrames is an active project, though not as active
> as Spark of course. There will be lulls in development since the people
> driving that project forward also have major commitments to other projects.
> This is natural.
>
> If you post on GitHub I would wager somewhere there (maybe Joseph or Tim
> ?) should
> be able to answer your questions about GraphFrames.
>
>
>1. The page you linked refers to a *plan* to move GraphFrames to the
>standard Spark release cycle. Is this *plan* publicly available /
>visible?
>
> I didn’t see any such reference to a plan in the page I linked you to.
> Rather, the page says 
> :
>
> The current plan is to keep GraphFrames separate from core Apache Spark
> for the time being.
>
> Nick
> 
>
> On Mon, Mar 13, 2017 at 5:46 PM enzo 
> wrote:
>
>> Nick
>>
>> Thanks for the quick answer :)
>>
>> Sadly, the comment in the page doesn’t answer my questions. More
>> specifically:
>>
>> 1. GraphFrames last activity in github was 2 months ago.  Last release on 12
>> Nov 2016.  Till recently 2 month was close to a Spark release cycle.
>> Why there has been no major development since mid November?
>>
>> 2. The page you linked refers to a *plan* to move GraphFrames to the
>> standard Spark release cycle.  Is this *plan* publicly available / visible?
>>
>> 3. I couldn’t find any statement of intent to preserve either one or the
>> other APIs, or just merge them: in other words, there seem to be no
>> overarching plan for a cohesive & comprehensive graph API (I apologise in
>> advance if I’m wrong).
>>
>> 4. I was initially impressed by GraphFrames syntax in places similar to
>> Neo4J Cypher (now open source), but later I understood was an incomplete
>> lightweight experiment (with no intention to move to full compatibility,
>> perhaps for good reasons).  To me it sort of gave the wrong message.
>>
>> 5. In the mean time the world of graphs is changing. GraphBlas forum
>> seems to make some traction: a library based on GraphBlas has been made
>> available on Accumulo (Graphulo).  Assuming that Spark is NOT going to
>> adopt similar lines, nor to follow Datastax with tinkertop and Gremlin,
>> again, what is the new,  cohesive & comprehensive API that Spark is going
>> to deliver?
>>
>>
>> Sadly, the API uncertainty may force developers to more stable kind of
>> API / platforms & roadmaps.
>>
>>
>>
>> Thanks Enzo
>>
>> On 13 Mar 2017, at 22:09, Nicholas Chammas 
>> wrote:
>>
>> Your question is answered here under "Will GraphFrames be part of Apache
>> Spark?", no?
>>
>> http://graphframes.github.io/#what-are-graphframes
>>
>> Nick
>>
>> On Mon, Mar 13, 2017 at 4:56 PM enzo 
>> wrote:
>>
>> Please see this email  trail:  no answer so far on the user@spark
>> board.  Trying the developer board for better luck
>>
>> The question:
>>
>> I am a bit confused by the current roadmap for graph and graph analytics
>> in Apache Spark.
>>
>> I understand that we have had for some time two libraries (the following
>> is my understanding - please amend as appropriate!):
>>
>> . GraphX, part of Spark

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-24 Thread Tim Hunter

Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/
graphframes/Logging.scala

Regarding the UDTs, they have been hidden to be reworked for Datasets, the
reasons being detailed here [1]. Can you describe your use case in more
details? You may be better off copy/pasting the UDT code outside of Spark,
depending on your use case.

[1] https://issues.apache.org/jira/browse/SPARK-14155

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley 
wrote:

> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498 !
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran 
> wrote:
>
>>
>> On 22 Feb 2017, at 20:51, Shouheng Yi 
>> wrote:
>>
>> Hi Spark developers,
>>
>> Currently my team at Microsoft is extending Spark’s machine learning
>> functionalities to include new learners and transformers. We would like
>> users to use these within spark pipelines so that they can mix and match
>> with existing Spark learners/transformers, and overall have a native spark
>> experience. We cannot accomplish this using a non-“org.apache” namespace
>> with the current implementation, and we don’t want to release code inside
>> the apache namespace because it’s confusing and there could be naming
>> rights issues.
>>
>>
>> This isn't actually the ASF has a strong stance against, more left to
>> projects themselves. After all: the source is licensed by the ASF, and the
>> license doesn't say you can't.
>>
>> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
>> hive team kept stuff package private. Though that's really a sign that
>> things could be improved there.
>>
>> Where is problematic is that stack traces end up blaming the wrong group;
>> nobody likes getting a bug report which doesn't actually exist in your
>> codebase., not least because you have to waste time to even work it out.
>>
>> You also have to expect absolutely no stability guarantees, so you'd
>> better set your nightly build to work against trunk
>>
>> Apache Bahir does put some stuff into org.apache.spark.stream, but
>> they've sort of inherited that right.when they picked up the code from
>> spark. new stuff is going into org.apache.bahir
>>
>>
>> We need to extend several classes from spark which happen to have
>> “private[spark].” For example, one of our class extends VectorUDT[0] which
>> has private[spark] class VectorUDT as its access modifier. This
>> unfortunately put us in a strange scenario that forces us to work under the
>> namespace org.apache.spark.
>>
>> To be specific, currently the private classes/traits we need to use to
>> create new Spark learners & Transformers are HasInputCol, VectorUDT and
>> Logging. We will expand this list as we develop more.
>>
>>
>> I do think tis a shame that logging went from public to private.
>>
>> One thing that could be done there is to copy the logging into Bahir,
>> under an org.apache.bahir package, for yourself and others to use. That's
>> be beneficial to me too.
>>
>> For the ML stuff, that might be place to work too, if you are going to
>> open source the code.
>>
>>
>>
>> Is there a way to avoid this namespace issue? What do other
>> people/companies do in this scenario? Thank you for your help!
>>
>>
>> I've hit this problem in the past.  Scala code tends to force your hand
>> here precisely because of that (very nice) private feature. While it offers
>> the ability of a project to guarantee that implementation details aren't
>> picked up where they weren't intended to be, in OSS dev, all that
>> implementation is visible and for lower level integration,
>>
>> What I tend to do is keep my own code in its package and try to do as
>> think a bridge over to it from the [private] scope. It's also important to
>> name things obviously, say,  org.apache.spark.microsoft , so stack traces
>> in bug reports can be dealt with more easily
>>
>>
>> [0]: https://github.com/apache/spark/blob/master/mllib/src/
>> main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
>>
>> Best,
>> Shouheng
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Tim Hunter

As Sean wrote very nicely above, the changes made to Spark are decided in
an organic fashion based on the interests and motivations of the committers
and contributors. The case of deep learning is a good example. There is a
lot of interest, and the core algorithms could be implemented without too
much problem in a few thousands of lines of scala code. However, the
performance of such a simple implementation would be one to two order of
magnitude slower than what would get from the popular frameworks out there.

At this point, there are probably more man-hours invested in TensorFlow (as
an example) than in MLlib, so I think we need to be realistic about what we
can expect to achieve inside Spark. Unlike BLAS for linear algebra, there
is no agreed-up interface for deep learning, and each of the XOnSpark
flavors explores a slightly different design. It will be interesting to see
what works well in practice. In the meantime, though, there are plenty of
things that we could do to help developers of other libraries to have a
great experience with Spark. Matei alluded to that in his Spark Summit
keynote when he mentioned better integration with low-level libraries.

Tim


On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath 
wrote:

> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:
>
>> Sean has given a great explanation.  A few more comments:
>>
>> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
>> have all committers working on MLlib help to set that roadmap, based on
>> either their knowledge of current maintenance/internal needs of the project
>> or the feedback given from the rest of the community.
>> @Committers - I see people actively shepherding PRs for MLlib, but I
>> don't see many major initiatives linked to the roadmap.  If there are ones
>> large enough to merit adding to the roadmap, please do.
>>
>> In general, there are many process improvements we could make.  A few in
>> my mind are:
>> * Visibility: Let the community know what committers are focusing on.
>> This was the primary purpose of the "MLlib roadmap proposal."
>> * Community initiatives: This is currently very organic.  Some of the
>> organic process could be improved, such as encouraging Votes/Watchers
>> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
>> work is a great step towards adding more clarity and structure for major
>> initiatives.
>> * JIRA hygiene: Always a challenge, and always requires some manual
>> prodding.  But it's great to push for efforts on this.
>>

Re: Design document - MLlib's statistical package for DataFrames

2017-02-17 Thread Tim Hunter

Hi Brad,

this task is focusing on moving the existing algorithms, so that we
are held up by parity issues.

Do you have some paper suggestions for cardinality? I do not think
there is a feature request on JIRA either.

Tim

On Thu, Feb 16, 2017 at 2:21 PM, bradc  wrote:
> Hi,
>
> While it is also missing in spark.mllib, I'd suggest adding cardinality as
> part of the Simple descriptive statistics for both spark.ml and spark.mlib?
> This is useful even for data in double precision FP to understand the
> "uniqueness" of the feature data.
>
> Cheers,
> Brad
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Design-document-MLlib-s-statistical-package-for-DataFrames-tp21014p21016.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Design document - MLlib's statistical package for DataFrames

2017-02-16 Thread Tim Hunter

Hello all,

I have been looking at some of the missing items for complete feature
parity between spark.ml and spark.mllib. Here is a proposal for
porting mllib.stats, the descriptive statistics package:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit?usp=sharing

The umbrella ticket for this task is:
https://issues.apache.org/jira/browse/SPARK-4591

Please comment on the document. Also, if you want to work on one of
the algorithms, the design doc and the umbrella ticket have subtasks
that you can assign yourself to.

The cutoff deadline for Spark 2.2 is rapidly approaching, and it would
be great if we could claim parity for this release!

Cheers

Tim

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Improvement Proposals

2017-01-05 Thread Tim Hunter

Hi Cody,
thank you for bringing up this topic, I agree it is very important to keep
a cohesive community around some common, fluid goals. Here are a few
comments about the current document:

1. name: it should not overlap with an existing one such as SIP. Can you
imagine someone trying to discuss a scala spore proposal for spark?
"[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
sounds great.

2. roles: at a high level, SPIPs are meant to reach consensus for technical
decisions with a lasting impact. As such, the template should emphasize the
role of the various parties during this process:

 - the SPIP author is responsible for building consensus. She is the
champion driving the process forward and is responsible for ensuring that
the SPIP follows the general guidelines. The author should be identified in
the SPIP. The authorship of a SPIP can be transferred if the current author
is not interested and someone else wants to move the SPIP forward. There
should probably be 2-3 authors at most for each SPIP.

 - someone with voting power should probably shepherd the SPIP (and be
recorded as such): ensuring that the final decision over the SPIP is
recorded (rejected, accepted, etc.), and advising about the technical
quality of the SPIP: this person need not be a champion for the SPIP or
contribute to it, but rather makes sure it stands a chance of being
approved when the vote happens. Also, if the author cannot find anyone who
would want to take this role, this proposal is likely to be rejected anyway.

 - users, committers, contributors have the roles already outlined in the
document

3. timeline: ideally, once a SPIP has been offered for voting, it should
move swiftly into either being accepted or rejected, so that we do not end
up with a distracting long tail of half-hearted proposals.

These rules are meant to be flexible, but the current document should be
clear about who is in charge of a SPIP, and the state it is currently in.

We have had long discussions over some very important questions such as
approval. I do not have an opinion on these, but why not make a pick and
reevaluate this decision later? This is not a binding process at this point.

Tim


On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger  wrote:

> I don't have a concern about voting vs consensus.
>
> I have a concern that whatever the decision making process is, it is
> explicitly announced on the ticket for the given proposal, with an explicit
> deadline, and an explicit outcome.
>
>
> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid  wrote:
>
>> I'm also in favor of this.  Thanks for your persistence Cody.
>>
>> My take on the specific issues Joseph mentioned:
>>
>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>> earlier for consensus:
>>
>> > Majority vs consensus: My rationale is that I don't think we want to
>> consider a proposal approved if it had objections serious enough that
>> committers down-voted (or PMC depending on who gets a vote). If these
>> proposals are like PEPs, then they represent a significant amount of
>> community effort and I wouldn't want to move forward if up to half of the
>> community thinks it's an untenable idea.
>>
>> 2) Design doc template -- agree this would be useful, but also seems
>> totally orthogonal to moving forward on the SIP proposal.
>>
>> 3) agree w/ Joseph's proposal for updating the template.
>>
>> One small addition:
>>
>> 4) Deciding on a name -- minor, but I think its wroth disambiguating from
>> Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
>> one has objected.  (don't care enough that I'd object to anything else,
>> though.)
>>
>>
>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley 
>> wrote:
>>
>>> Hi Cody,
>>>
>>> Thanks for being persistent about this.  I too would like to see this
>>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>>> * Decide about a few issues
>>> * Finalize the doc(s)
>>> * Vote on this proposal
>>>
>>> Issues & TODOs:
>>>
>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>> preference here.  It sounds like something which could be tailored based on
>>> whether we see too many or too few SIPs being approved.
>>>
>>> (2) Design doc template  (This would be great to have for Spark
>>> regardless of this SIP discussion.)
>>> * Reynold, are you still putting this together?
>>>
>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>> w.r.t. Reynold's draft
>>> 
>>> :
>>> * Reinstate the "Where" section with links to current and past SIPs
>>> * Add field for stating explicit deadlines for approval
>>> * Add field for stating Author & Committer shepherd
>>>
>>> Thanks all!
>>> Joseph
>>>
>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger 
>>> wrote:
>>>

GraphFrames 0.2.0 released

2016-08-16 Thread Tim Hunter

Hello all,
I have released version 0.2.0 of the GraphFrames package. Apart from a few
bug fixes, it is the first release published for Spark 2.0 and both scala
2.10 and 2.11. Please let us know if you have any comment or questions.

It is available as a Spark package:
https://spark-packages.org/package/graphframes/graphframes

The source code is available as always at
https://github.com/graphframes/graphframes


What is GraphFrames?

GraphFrames is a DataFrame-based graph engine Spark. In addition to the
algorithms available in GraphX, users can write highly expressive queries
by leveraging the DataFrame API, combined with a new API for motif finding.
The user also benefits from DataFrame performance optimizations within the
Spark SQL engine.

Cheers

Tim

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Tim Hunter

+1 This release passes all tests on the graphframes and tensorframes
packages.

On Wed, Jun 22, 2016 at 7:19 AM, Cody Koeninger  wrote:

> If we're considering backporting changes for the 0.8 kafka
> integration, I am sure there are people who would like to get
>
> https://issues.apache.org/jira/browse/SPARK-10963
>
> into 1.6.x as well
>
> On Wed, Jun 22, 2016 at 7:41 AM, Sean Owen  wrote:
> > Good call, probably worth back-porting, I'll try to do that. I don't
> > think it blocks a release, but would be good to get into a next RC if
> > any.
> >
> > On Wed, Jun 22, 2016 at 11:38 AM, Pete Robbins 
> wrote:
> >> This has failed on our 1.6 stream builds regularly.
> >> (https://issues.apache.org/jira/browse/SPARK-6005) looks fixed in 2.0?
> >>
> >> On Wed, 22 Jun 2016 at 11:15 Sean Owen  wrote:
> >>>
> >>> Oops, one more in the "does anybody else see this" department:
> >>>
> >>> - offset recovery *** FAILED ***
> >>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
> >>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
> >>>
> >>>
> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
> >>>
> >>>
> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
> >>>
> >>>
> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]
> >>> was false Recovered ranges are not the same as the ones generated
> >>> (DirectKafkaStreamSuite.scala:301)
> >>>
> >>> This actually fails consistently for me too in the Kafka integration
> >>> code. Not timezone related, I think.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames

2016-03-19 Thread Tim Hunter

Hello all,

I would like to bring your attention to a small project to integrate
TensorFlow with Apache Spark, called TensorFrames. With this library, you
can map, reduce or aggregate numerical data stored in Spark dataframes
using TensorFlow computation graphs. It is published as a Spark package and
available in this github repository:

https://github.com/tjhunter/tensorframes

More detailed examples can be found in the user guide:

https://github.com/tjhunter/tensorframes/wiki/TensorFrames-user-guide

This is a technical preview at this point. I am looking forward to some
feedback about the current python API if some adventurous users want to try
it out. Of course, contributions are most welcome, for example to fix bugs
or to add support for platforms other than linux-x86_64. It should support
all the most common inputs in dataframes (dense tensors of rank 0, 1, 2 of
ints, longs, floats and doubles).

Please note that this is not an endorsement by Databricks of TensorFlow, or
any other deep learning framework for that matter. If users want to use
deep learning in production, some other more robust solutions are
available: SparkNet, CaffeOnSpark, DeepLearning4J.

Best regards


Tim Hunter

Introducing spark-sklearn, a scikit-learn integration package for Spark

2016-02-10 Thread Tim Hunter

Hello community,
Joseph and I would like to introduce a new Spark package that should
be useful for python users that depend on scikit-learn.

Among other tools:
 - train and evaluate multiple scikit-learn models in parallel.
 - convert Spark's Dataframes seamlessly into numpy arrays
 - (experimental) distribute Scipy's sparse matrices as a dataset of
sparse vectors.

Spark-sklearn focuses on problems that have a small amount of data and
that can be run in parallel. Note this package distributes simple
tasks like grid-search cross-validation. It does not distribute
individual learning algorithms (unlike Spark MLlib).

If you want to use it, see instructions on the package page:
https://github.com/databricks/spark-sklearn

This blog post contains more details:
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html

Let us know if you have any questions. Also, documentation or code
contributions are much welcome (Apache 2.0 license).

Cheers

Tim

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org