Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
Indeed, though the possible advantage is that in theory, you can have
different release cycle than for the main repo (I am not sure if that's
feasible in practice or if that was the intention).

I guess all depends on how we envision the future of annotations
(including, but not limited to, how conservative we want to be in the
future). Which is probably something that should be discussed here.

On 8/4/20 11:06 PM, Felix Cheung wrote:
> So IMO maintaining outside in a separate repo is going to be harder.
> That was why I asked.
>
>
>  
> ----
> *From:* Maciej Szymkiewicz 
> *Sent:* Tuesday, August 4, 2020 12:59 PM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau;
> Spark Dev List
> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>  
>
> On 8/4/20 9:35 PM, Sean Owen wrote
> > Yes, but the general argument you make here is: if you tie this
> > project to the main project, it will _have_ to be maintained by
> > everyone. That's good, but also exactly I think the downside we want
> > to avoid at this stage (I thought?) I understand for some
> > undertakings, it's just not feasible to start outside the main
> > project, but is there no proof of concept even possible before taking
> > this step -- which more or less implies it's going to be owned and
> > merged and have to be maintained in the main project.
>
>
> I think we have a bit different understanding here ‒ I believe we have
> reached a conclusion that maintaining annotations within the project is
> OK, we only differ when it comes to specific form it should take.
>
> As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.  There is some evidence there are used in
> the wild
> (https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
> there are a few contributors
> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
> least some use cases (https://stackoverflow.com/q/40163106/). So,
> subjectively speaking, it seems we're already beyond POC.
>
> -- 
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz

On 8/4/20 9:35 PM, Sean Owen wrote
> Yes, but the general argument you make here is: if you tie this
> project to the main project, it will _have_ to be maintained by
> everyone. That's good, but also exactly I think the downside we want
> to avoid at this stage (I thought?) I understand for some
> undertakings, it's just not feasible to start outside the main
> project, but is there no proof of concept even possible before taking
> this step -- which more or less implies it's going to be owned and
> merged and have to be maintained in the main project.


I think we have a bit different understanding here ‒ I believe we have
reached a conclusion that maintaining annotations within the project is
OK, we only differ when it comes to specific form it should take.

As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.  There is some evidence there are used in
the wild
(https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
there are a few contributors
(https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
least some use cases (https://stackoverflow.com/q/40163106/). So,
subjectively speaking, it seems we're already beyond POC.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
*First of all why ASF ownership? *

For the project of this size maintaining high quality (it is not hard to
use stubgen or monkeytype, but resulting annotations are rather
simplistic) annotations independent of the actual codebase is far from
trivial. For starters, changes which are mostly transparent to the final
user (like pyspark.ml changes in 3.0 / 3.1) might require significant
changes in the annotations. Additionally some signature changes are
rather hard to track and such separation can easily lead to divergence.

Additionally, annotations are as much about describing facts, as showing
intended usage (the simplest use case is documenting argument
dependencies). This makes process of annotation rather subjective and
requires good understanding of author's intention.

Finally, annotation-friendly signatures require conscious decisions (see
for example https://github.com/python/mypy/issues/5621).

Overall, ASF ownership is probably the best way to ensure long-term
sustainability and quality of annotations.

*Now, why separate repo?*

Based on the discussion so far it is clear that there is no consensus
about using inline annotations. There are three other options:

  * Stub files packaged alongside actual code.
  * Separate project within root, packaged separately.
  * Separate repository, packaged separately.

As already pointed out here and in the comments to
https://github.com/apache/spark/pull/29180, annotations are still
somewhat unstable. Ecosystem evolves quickly and new features, some
having potential for fundamental change in the way how we annotate code.

Therefore, it might be beneficial to maintain subproject (out of lack of
a better word), that can evolve faster than the code that is annotate.

While I have no strong opinion about this part, it is definitely a
relatively unobtrusive way of bringing code and annotations closer
together.

On 8/4/20 7:44 PM, Sean Owen wrote:

> Maybe more specifically, why an ASF repo?
>
> On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung  
> wrote:
>> What would be the reason for separate git repo?
>>
>> 
>> From: Hyukjin Kwon 
>> Sent: Monday, August 3, 2020 1:58:55 AM
>> To: Maciej Szymkiewicz 
>> Cc: Driesprong, Fokko ; Holden Karau 
>> ; Spark Dev List 
>> Subject: Re: [PySpark] Revisiting PySpark type annotations
>>
>> Okay, seems like we can create a separate repo as apache/spark? e.g.) 
>> https://issues.apache.org/jira/browse/INFRA-20470
>> We can also think about porting the files as are.
>> I will try to have a short sync with the author Maciej, and share what we 
>> discussed offline.
>>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko 
napisał(a):

> That's probably one-time overhead so it is not a big issue.  In my
> opinion, a bigger one is possible complexity. Annotations tend to introduce
> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
> don't look great.
>
>
> This is not true (anymore). With Python 3.6 you can add string annotations
> -> 'DenseVector', and in the future with Python 3.7 this is fixed by having
> postponed evaluation: https://www.python.org/dev/peps/pep-0563/
>

As far as I recall linked PEP addresses backrferences not cyclic
dependencies, which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context
depends on pyspark.rdd and the other way around. These dependencies are not
explicit at he moment.



> Merging stubs into project structure from the other hand has almost no
> overhead.
>
>
> This feels awkward to me, this is like having the docstring in a separate
> file. In my opinion you want to have the signatures and the functions
> together for transparency and maintainability.
>
>
I guess that's the matter of preference. From maintainability perspective
it is actually much easier to have separate objects.

For example there are different types of objects that are required for
meaningful checking, which don't really exist in real code (protocols,
aliases, code generated signatures fo let complex overloads) as well as
some monkey patched entities

Additionally it is often easier to see inconsistencies when typing is
separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and
keep it in-sync there with common CI pipeline and transfer ownership of
pypi package to ASF
- Move stubs directly into python/pyspark and then apply individual stubs
to .modules of choice.

Of course, the first proposal could be an initial step for the latter one.


>
> I think DBT is a very nice project where they use annotations very well:
> https://github.com/fishtown-analytics/dbt/blob/dev/marian-
> anderson/core/dbt/graph/queue.py
>
> Also, they left out the types in the docstring, since they are available
> in the annotations itself.
>
>

> In practice, the biggest advantage is actually support for completion, not
> type checking (which works in simple cases).
>
>
> Agreed.
>
> Would you be interested in writing up the Outreachy proposal for work on
> this?
>
>
> I would be, and also happy to mentor. But, I think we first need to agree
> as a Spark community if we want to add the annotations to the code, and in
> which extend.
>





> At some point (in general when things are heavy in generics, which is the
> case here), annotations become somewhat painful to write.
>
>
> That's true, but that might also be a pointer that it is time to refactor
> the function/code :)
>

That might the case, but it is more often a matter capturing useful
properties combined with requirement to keep things in sync with Scala
counterparts.



> For now, I tend to think adding type hints to the codes make it difficult
> to backport or revert and more difficult to discuss about typing only
> especially considering typing is arguably premature yet.
>
>
> This feels a bit weird to me, since you want to keep this in sync right?
> Do you provide different stubs for different versions of Python? I had to
> look up the literals: https://www.python.org/dev/peps/pep-0586/
>

I think it is more about portability between Spark versions

>
>
> Cheers, Fokko
>

> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
> mszymkiew...@gmail.com>:
>
>>
>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>> > For now, I tend to think adding type hints to the codes make it
>> > difficult to backport or revert and
>> > more difficult to discuss about typing only especially considering
>> > typing is arguably premature yet.
>>
>> About being premature ‒ since typing ecosystem evolves much faster than
>> Spark it might be preferable to keep annotations as a separate project
>> (preferably under AST / Spark umbrella). It allows for faster iterations
>> and supporting new features (for example Literals proved to be very
>> useful), without waiting for the next Spark release.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>>
>>

-- 

Best regards,
Maciej Szymkiewicz


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz

On 7/21/20 9:40 PM, Holden Karau wrote:
> Yeah I think this could be a great project now that we're only Python
> 3.5+. One potential is making this an Outreachy project to get more
> folks from different backgrounds involved in Spark.

I am honestly not sure if that's really the case.

At the moment I maintain almost complete set of annotations for the
project. These could  be ported in a single step with relatively little
effort.

As of the further maintenance ‒ this will have to be done along the
codebase changes to keep things in sync, so if outreach means
low-hanging-fruit, it is uniquely to serve this purpose.

Additionally, there are at least two considerations:

  * At some point (in general when things are heavy in generics, which
is the case here), annotations become somewhat painful to write.
  * In ideal case API design has to be linked (to reasonable extent)
with annotations design ‒ not every signature can be annotated in a
meaningful way, which is already a problem with some chunks of Spark
code.

>
> On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko
>  wrote:
>
> Since we've recently dropped support for Python <=3.5
> <https://github.com/apache/spark/pull/28957>, I think it would be
> nice to add support for type annotations. Having this in the main
> repository allows us to do type checking using MyPy
> <http://mypy-lang.org/> in the CI itself.
>
> This is now handled by the Stub
> file: https://www.python.org/dev/peps/pep-0484/#stub-files However
> I think it is nicer to integrate the types with the code itself to
> keep everything in sync, and make it easier for the people who
> work on the codebase itself. A first step would be to move the
> stubs into the codebase. First step would be to cover the public
> API which is the most important one. Having the types with the
> code itself makes it much easier to understand. For example, if
> you can supply a str or column
> here: 
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on Python should
> cover annotations on the public API's. Curious what the rest of
> the community thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323
> mailto:mszymkiew...@gmail.com>>:
>
> Given a discussion related to  SPARK-32320 PR
> <https://github.com/apache/spark/pull/29122>   I'd like to
> resurrect this
> thread. Is there any interest in migrating annotations to the main
> repository?
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
 it easier for
> the people who work on the codebase itself. A first step
> would be to move the stubs into the codebase. First step
> would be to cover the public API which is the most
> important one. Having the types with the code itself makes
> it much easier to understand. For example, if you can
> supply a str or column
> here: 
> https://github.com/apache/spark/pull/29122/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2486
>
> One of the implications would be that future PR's on
> Python should cover annotations on the public API's.
> Curious what the rest of the community thinks.
>
> Cheers, Fokko
>
>
>
>
>
>
>
>
>
> Op di 21 jul. 2020 om 20:04 schreef zero323
> mailto:mszymkiew...@gmail.com>>:
>
> Given a discussion related to  SPARK-32320 PR
> <https://github.com/apache/spark/pull/29122>   I'd
> like to resurrect this
> thread. Is there any interest in migrating annotations
> to the main
> repository?
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> 
> -
> To unsubscribe e-mail:
>     dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
Hi Ben,

Please note that `_sc` is not a SQLContext. It is a SparkContext, which
is used primarily for internal calls.

SQLContext is exposed through `sql_ctx`
(https://github.com/apache/spark/blob/8bfaa62f2fcc942dd99a63b20366167277bce2a1/python/pyspark/sql/dataframe.py#L80)

On 3/17/20 5:53 PM, Ben Roling wrote:
> I tried this on the users mailing list but didn't get traction.  It's
> probably more appropriate here anyway.
>
> I've noticed that DataSet.sqlContext is public in Scala but the
> equivalent (DataFrame._sc) in PySpark is named as if it should be
> treated as private.
>
> Is this intentional?  If so, what's the rationale?  If not, then it
> feels like a bug and DataFrame should have some form of public access
> back to the context/session.  I'm happy to log the bug but thought I
> would ask here first.  Thanks!

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A




signature.asc
Description: OpenPGP digital signature


Re: Apache Spark Docker image repository

2020-02-06 Thread Maciej Szymkiewicz

On 2/6/20 2:53 AM, Jiaxin Shan wrote:
> I will vote for this. It's pretty helpful to have managed Spark
> images. Currently, user have to download Spark binaries and build
> their own. 
> With this supported, user journey will be simplified and we only need
> to build an application image on top of base image provided by community. 
>
> Do we have different OS or architecture support? If not, there will be
> Java, R, Python total 3 container images for every release.

Well, technically speaking there are 3 non-deprecated Python versions (4
if you count PyPy), 3 non-deprecated R versions, luckily only one
non-deprecated Scala version and possible variations of JDK. Latest and
greatest are not necessarily the most popular and useful.

That's on top of native dependencies like BLAS (possibly in different
flavors and accounting for netlib-java break in development), libparquet
and libarrow.

Not all of these must be generated, but complexity grows pretty fast,
especially when native dependencies are involved. It gets worse if you
actually want to support Spark builds and tests ‒ for example to build
and fully test SparkR builds you need half of the universe including
some awkward LaTex style patches and such
(https://github.com/zero323/sparkr-build-sandbox).

End even without that images tend to grow pretty large.

Few years back me and Elias <https://github.com/eliasah> experimented
with the idea of generating different sets of Dockerfiles ‒
https://github.com/spark-in-a-box/spark-in-a-box ‒ intended use cases
where rather different (mostly quick setup of testbeds) though. The
project has been inactive for a while, with some private patches to fit
this or that use case.

>
> On Wed, Feb 5, 2020 at 2:56 PM Sean Owen  <mailto:sro...@gmail.com>> wrote:
>
> What would the images have - just the image for a worker?
> We wouldn't want to publish N permutations of Python, R, OS, Java,
> etc.
> But if we don't then we make one or a few choices of that combo, and
> then I wonder how many people find the image useful.
> If the goal is just to support Spark testing, that seems fine and
> tractable, but does it need to be 'public' as in advertised as a
> convenience binary? vs just some image that's hosted somewhere for the
> benefit of project infra.
>
> On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun
> mailto:dongjoon.h...@gmail.com>> wrote:
> >
> > Hi, All.
> >
> > From 2020, shall we have an official Docker image repository as
> an additional distribution channel?
> >
> > I'm considering the following images.
> >
> >     - Public binary release (no snapshot image)
> >     - Public non-Spark base image (OS + R + Python)
> >       (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler
> environments)
> >
> > Bests,
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>
>
> -- 
> Best Regards!
> Jiaxin Shan
> Tel:  412-230-7670
> Address: 470 2nd Ave S, Kirkland, WA
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
I think it is important to distinguish between two different concepts:

  * Adherence to standards and their well established implementations.
  * Enabling migrations from some product X to Spark.

While these two problems are related, there are independent and one can
be achieved without the other.

  * The former approach doesn't imply that all features of SQL standard
(or its specific implementation) are provided. It is sufficient that
commonly used features that are implemented, are standard compliant.
Therefore if end user applies some well known pattern, thing will
work as expected. I

In my personal opinion that's something that is worth the required
development resources, and in general should happen within the project.

  * The latter one is more complicated. First of all the premise that
one can "migrate PostgreSQL workloads to Spark" seems to be flawed.
While both Spark and PostgreSQL evolve, and probably have more in
common today, than a few years ago, they're not even close enough to
pretend that one can be replacement for the other. In contrast,
existing compatibility layers between major vendors make sense,
because feature disparity (at least when it comes to core
functionality) is usually minimal. And that doesn't even touch the
problem that PostgreSQL provides extensively used extension points
that enable broad and evolving ecosystem (what should we do about
continuous queries? Should Structured Streaming provide some
compatibility layer as well?).

More realistically Spark could provide a compatibility layer with
some analytical tools that itself provide some PostgreSQL
compatibility, but these are not always fully compatible with
upstream PostgreSQL, nor necessarily follow the latest PostgreSQL
development.

Furthermore compatibility layer can be, within certain limits (i.e.
availability of required primitives), maintained as a separate
project, without putting more strain on existing resources.
Effectively what we care about here is if we can translate certain
SQL string into logical or physical plan.


On 11/26/19 3:26 PM, Wenchen Fan wrote:
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several
> inappropriate behaviors of Spark to follow SQL standard and
> PostgreSQL. Many thanks to all the people that contribute to it!
>
> There are several cases when adding a PostgreSQL feature:
> 1. Spark doesn't have this feature: just add it.
> 2. Spark has this feature, but the behavior is different:
>     2.1 Spark's behavior doesn't make sense: change it to follow SQL
> standard and PostgreSQL, with a legacy config to restore the behavior.
>     2.2 Spark's behavior makes sense but violates SQL standard: change
> the behavior to follow SQL standard and PostgreSQL, when the ansi mode
> is enabled (default false).
>     2.3 Spark's behavior makes sense and doesn't violate SQL standard:
> adds the PostgreSQL behavior under the PostgreSQL dialect (default is
> Spark native dialect).
>
> The PostgreSQL dialect itself is a good idea. It can help users to
> migrate PostgreSQL workloads to Spark. Other databases have this
> strategy too. For example, DB2 provides an oracle dialect
> .
>
> However, there are so many differences between Spark and PostgreSQL,
> including SQL parsing, type coercion, function/operator behavior, data
> types, etc. I'm afraid that we may spend a lot of effort on it, and
> make the Spark codebase pretty complicated, but still not able to
> provide a usable PostgreSQL dialect.
>
> Furthermore, it's not clear to me how many users have the requirement
> of migrating PostgreSQL workloads. I think it's much more important to
> make Spark ANSI-compliant first, which doesn't need that much of work.
>
> Recently I've seen multiple PRs adding PostgreSQL cast functions,
> while our own cast function is not ANSI-compliant yet. This makes me
> think that, we should do something to properly prioritize ANSI mode
> over other dialects.
>
> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
> from the codebase before it's too late. Curently we only have 3
> features under PostgreSQL dialect:
> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are
> also allowed as true string.
> 2. `date - date`  returns interval in Spark (SQL standard behavior),
> but return int in PostgreSQL
> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
> (there is no standard)
>
> We should still add PostgreSQL features that Spark doesn't have, or
> Spark's behavior violates SQL standard. But for others, let's just
> update the 

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Maciej Szymkiewicz
Could we upgrade to PyPy3.6 v7.2.0?

On 10/30/19 9:45 PM, Shane Knapp wrote:
> one quick thing:  we currently test against python2.7, 3.6 *and*
> pypy2.5.1 (python2.7).
>
> what are our plans for pypy?
>
>
> On Wed, Oct 30, 2019 at 12:26 PM Dongjoon Hyun
> mailto:dongjoon.h...@gmail.com>> wrote:
>
> Thank you all. I made a PR for that.
>
> https://github.com/apache/spark/pull/26326
>
> On Tue, Oct 29, 2019 at 5:45 AM Takeshi Yamamuro
> mailto:linguin@gmail.com>> wrote:
>
> +1, too.
>
> On Tue, Oct 29, 2019 at 4:16 PM Holden Karau
> mailto:hol...@pigscanfly.ca>> wrote:
>
> +1 to deprecating but not yet removing support for 3.6
>
> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp
> mailto:skn...@berkeley.edu>> wrote:
>
> +1 to testing the absolute minimum number of python
> variants as possible.  ;)
>
> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon
> mailto:gurwls...@gmail.com>> wrote:
>
> +1 from me as well.
>
> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng
>  <mailto:m...@databricks.com>>님이 작성:
>
> +1. And we should start testing 3.7 and maybe
> 3.8 in Jenkins.
>
> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun
>  <mailto:dongjoon.h...@gmail.com>> wrote:
>
> Thank you for starting the thread.
>
> In addition to that, we currently are
> testing Python 3.6 only in Apache Spark
> Jenkins environment.
>
> Given that Python 3.8 is already out and
> Apache Spark 3.0.0 RC1 will start next January
> (https://spark.apache.org/versioning-policy.html),
> I'm +1 for the deprecation (Python < 3.6)
> at Apache Spark 3.0.0.
>
> It's just a deprecation to prepare the
>     next-step development cycle.
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 24, 2019 at 1:10 AM Maciej
> Szymkiewicz  <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi everyone,
>
> While deprecation of Python 2 in 3.0.0
> has been announced
> 
> <https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
> there is no clear statement about
> specific continuing support of
> different Python 3 version.
>
> Specifically:
>
>   * Python 3.4 has been retired this year.
>   * Python 3.5 is already in the
> "security fixes only" mode and
> should be retired in the middle of
> 2020.
>
> Continued support of these two blocks
> adoption of many new Python features
> (PEP 468)  and it is hard to justify
> beyond 2020.
>
> Should these two be deprecated in
> 3.0.0 as well?
>
> -- 
> Best regards,
> Maciej
>
>
>
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark,
> etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
> -- 
> ---
> Takeshi Yamamuro
>
>
>
> -- 
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu

-- 
Best regards,
Maciej



[DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Maciej Szymkiewicz
Hi everyone,

While deprecation of Python 2 in 3.0.0 has been announced
,
there is no clear statement about specific continuing support of
different Python 3 version.

Specifically:

  * Python 3.4 has been retired this year.
  * Python 3.5 is already in the "security fixes only" mode and should
be retired in the middle of 2020.

Continued support of these two blocks adoption of many new Python
features (PEP 468)  and it is hard to justify beyond 2020.

Should these two be deprecated in 3.0.0 as well?

-- 
Best regards,
Maciej



Is SPARK-9961 is still relevant?

2019-10-05 Thread Maciej Szymkiewicz
Hi everyone,

I just encountered SPARK-9961
 which seems to be
largely outdated today.

In the latest releases majority of models computes different evaluation
metrics exposed later through corresponding summaries.  At the same time
such defaultEvaluator has little potential of being integrated with
Spark ML tuning tools, which depend on  input and output columns, not
input estimator.

Assuming that the answer is negative, and this ticket can be closed,
should DeveloperApi annotations removed? If I understand this ticket
correctly, planned defaultEvaluator was the primary reason to use such
annotation there.

-- 
Best regards,
Maciej



Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-20 Thread Maciej Szymkiewicz
One concern here is introduction of second formatting convention.

This can not only cause confusion among users, but also result in some hard
to spot bugs, when wrong format, with different meaning, is used. This is
already a problem for Python and R users, with week year and months /
minutes mixups popping out from time to time.

On Wed, 20 Mar 2019 at 10:53, Gabor Kaszab  wrote:

> Hey Hive and Spark communities,
> [dev@impala in cc]
>
> I'm working on an Impala improvement to introduce the FORMAT clause within
> CAST() operator and to implement ISO SQL:2016 datetime pattern support for
> this new FORMAT clause:
> https://issues.apache.org/jira/browse/IMPALA-4018
>
> One example of the new format:
> SELECT(CAST("2018-01-02 09:15" as timestamp FORMAT "-MM-DD HH12:MI"));
>
> I have put together a document for my proposal of how to do this in Impala
> and what patterns we plan to support to cover the SQL standard and what
> additional patterns we propose to support on top of the standard's
> recommendation.
>
> https://docs.google.com/document/d/1V7k6-lrPGW7_uhqM-FhKl3QsxwCRy69v2KIxPsGjc1k/
>
> The reason I share this with the Hive and Spark communities because I feel
> it would be nice that these systems were in line with the Impala
> implementation. So I'd like to involve these communities to the planning
> phase of this task so that everyone can share their opinion about whether
> this make sense in the proposed form.
> Eventually I feel that each of these systems should have the SQL:2016
> datetime format and I think it would be nice to have it with a newly
> introduced CAST(..FORMAT..) clause.
>
> I would like to ask members from both Hive and Spark to take a look at my
> proposal and share their opinion from their own component's perspective. If
> we get on the same page I'll eventually open Jiras to cover this
> improvement for each mentioned systems.
>
> Cheers,
> Gabor
>
>
>
>

-- 

Regards,
Maciej


Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high resource usage.

On Sun, 3 Feb 2019 at 15:41, Sean Owen  wrote:

> I don't think Spark supports this model, where N inputs depending on
> parent are computed once at the same time. You can of course cache the
> parent and filter N times and do the same amount of work. One problem is,
> where would the N inputs live? they'd have to be stored if not used
> immediately, and presumably in any use case, only one of them would be used
> immediately. If you have a job that needs to split records of a parent into
> N subsets, and then all N subsets are used, you can do that -- you are just
> transforming the parent to one child that has rows with those N splits of
> each input row, and then consume that. See randomSplit() for maybe the best
> case, where it still produce N Datasets but can do so efficiently because
> it's just a random sample.
>
> On Sun, Feb 3, 2019 at 12:20 AM Moein Hosseini  wrote:
>
>> I don't consider it as method to apply filtering multiple time, instead
>> use it as semi-action not just transformation. Let's think that we have
>> something like map-partition which accept multiple lambda that each one
>> collect their ROW for their dataset (or something like it). Is it possible?
>>
>> On Sat, Feb 2, 2019 at 5:59 PM Sean Owen  wrote:
>>
>>> I think the problem is that can't produce multiple Datasets from one
>>> source in one operation - consider that reproducing one of them would mean
>>> reproducing all of them. You can write a method that would do the filtering
>>> multiple times but it wouldn't be faster. What do you have in mind that's
>>> different?
>>>
>>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini 
>>> wrote:
>>>
 I've seen many application need to split dataset to multiple datasets
 based on some conditions. As there is no method to do it in one place,
 developers use *filter *method multiple times. I think it can be
 useful to have method to split dataset based on condition in one iteration,
 something like *partition* method of scala (of-course scala partition
 just split list into two list, but something more general can be more
 useful).
 If you think it can be helpful, I can create Jira issue and work on it
 to send PR.

 Best Regards
 Moein

 --

 Moein Hosseini
 Data Engineer
 mobile: +98 912 468 1859 <+98+912+468+1859>
 site: www.moein.xyz
 email: moein...@gmail.com
 [image: linkedin] 
 [image: twitter] 


>>
>> --
>>
>> Moein Hosseini
>> Data Engineer
>> mobile: +98 912 468 1859 <+98+912+468+1859>
>> site: www.moein.xyz
>> email: moein...@gmail.com
>> [image: linkedin] 
>> [image: twitter] 
>>
>>

-- 

Regards,
Maciej


[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
Hello everyone,

I'd like to revisit the topic of adding PySpark type annotations in 3.0. It
has been discussed before (
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html
and
http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)
and is tracked by SPARK-17333 (
https://issues.apache.org/jira/browse/SPARK-17333). Is there any consensus
here?

In the spirit of full disclosure I am trying to decide if, and if yes to
what extent, migrate my stub package (
https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining
such package is relatively time consuming (not being active PySpark user
anymore, it is the least priority for me at the moment) and if there any
official plans to make it obsolete, it would be a valuable information for
me.

If there are no plans to add native annotations to PySpark, I'd like to use
this opportunity to ask PySpark commiters, to drop by and open issue (
https://github.com/zero323/pyspark-stubs/issues)  when new methods are
introduced, or there are changes in the existing API (PR's are of course
welcomed as well). Thanks in advance.

-- 
Best,
Maciej


Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by
default (with exception to __init__). There is :special-members: option
which could be passed to, for example, autoclass.

On Tue, 23 Oct 2018 at 21:32, Sean Owen  wrote:

> (& and | are both logical and bitwise operators in Java and Scala, FWIW)
>
> I don't see them in the python docs; they are defined in column.py but
> they don't turn up in the docs. Then again, they're not documented:
>
> ...
> __and__ = _bin_op('and')
> __or__ = _bin_op('or')
> __invert__ = _func_op('not')
> __rand__ = _bin_op("and")
> __ror__ = _bin_op("or")
> ...
>
> I don't know if there's a good reason for it, but go ahead and doc
> them if they can be.
> While I suspect their meaning is obvious once it's clear they aren't
> the bitwise operators, that part isn't obvious/ While it matches
> Java/Scala/Scala-Spark syntax, and that's probably most important, it
> isn't typical for python.
>
> The comments say that it is not possible to overload 'and' and 'or',
> which would have been more natural.
>
> On Tue, Oct 23, 2018 at 2:20 PM Nicholas Chammas
>  wrote:
> >
> > Also, to clarify something for folks who don't work with PySpark: The
> boolean column operators in PySpark are completely different from those in
> Scala, and non-obvious to boot (since they overload Python's _bitwise_
> operators). So their apparent absence from the docs is surprising.
> >
> > On Tue, Oct 23, 2018 at 3:02 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> So it appears then that the equivalent operators for PySpark are
> completely missing from the docs, right? That’s surprising. And if there
> are column function equivalents for |, &, and ~, then I can’t find those
> either for PySpark. Indeed, I don’t think such a thing is possible in
> PySpark. (e.g. (col('age') > 0).and(...))
> >>
> >> I can file a ticket about this, but I’m just making sure I’m not
> missing something obvious.
> >>
> >>
> >> On Tue, Oct 23, 2018 at 2:50 PM Sean Owen  wrote:
> >>>
> >>> Those should all be Column functions, really, and I see them at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
> >>>
> >>> On Tue, Oct 23, 2018, 12:27 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> 
>  I can’t seem to find any documentation of the &, |, and ~ operators
> for PySpark DataFrame columns. I assume that should be in our docs
> somewhere.
> 
>  Was it always missing? Am I just missing something obvious?
> 
>  Nick
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
For the reference I raised question of Python 2 support before -
http://apache-spark-developers-list.1001551.n3.nabble.com/Future-of-the-Python-2-support-td20094.html



On Sat, 15 Sep 2018 at 15:14, Alexander Shorin  wrote:

> What's the release due for Apache Spark 3.0? Will it be tomorrow or
> somewhere at the middle of 2019 year?
>
> I think we shouldn't care much about Python 2.x today, since quite
> soon it support turns into pumpkin. For today's projects I hope nobody
> takes into account support of 2.7 unless there is some legacy still to
> carry on, but do we want to take that baggage into Apache Spark 3.x
> era? The next time you may drop it would be only 4.0 release because
> of breaking change.
>
> --
> ,,,^..^,,,
> On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
>  wrote:
> >
> > There is no need to ditch Python 2. There are basically two options
> >
> > Use stub files and limit yourself to support only Python 3 support.
> Python 3 users benefit from type hints, Python 2 users don't, but no core
> functionality is affected. This is the approach I've used with
> https://github.com/zero323/pyspark-stubs/.
> > Use comment based inline syntax or stub files and don't use backward
> incompatible features (primarily typing module -
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
> supported, but more advanced components are not. Small win for Python 2
> users, moderate loss for Python 3 users.
> >
> >
> >
> > On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> Do we need to ditch Python 2 support to provide type hints? I don’t
> think so.
> >>
> >> Python lets you specify typing stubs that provide the same benefit
> without forcing Python 3.
> >>
> >> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
> >>>
> >>>
> >>>
> >>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
> wrote:
> >>>>
> >>>> To be clear, is this about "python-friendly API" or "friendly python
> API" ?
> >>>
> >>> Well what would you consider to be different between those two
> statements? I think it would be good to be a bit more explicit, but I don't
> think we should necessarily limit ourselves.
> >>>>
> >>>>
> >>>> On the python side, it might be nice to take advantage of static
> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might
> be a good opportunity to jump the python-3-only train.
> >>>
> >>> I think we can make types sort of work without ditching 2 (the types
> only would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
> >>>>
> >>>>
> >>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
> >>>>>
> >>>>> Since we're talking about Spark 3.0 in the near future (and since
> some recent conversation on a proposed change reminded me) I wanted to open
> up the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
> >>>>>
> >>>>> --
> >>>>> Twitter: https://twitter.com/holdenkarau
> >>>>> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>>
> >>>>
> >
> >
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options

   - Use stub files and limit yourself to support only Python 3 support.
   Python 3 users benefit from type hints, Python 2 users don't, but no core
   functionality is affected. This is the approach I've used with
   https://github.com/zero323/pyspark-stubs/.
   - Use comment based inline syntax or stub files and don't use backward
   incompatible features (primarily typing module -
   https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
   supported, but more advanced components are not. Small win for Python 2
   users, moderate loss for Python 3 users.



On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas 
wrote:

> Do we need to ditch Python 2 support to provide type hints? I don’t think
> so.
>
> Python lets you specify typing stubs that provide the same benefit without
> forcing Python 3.
>
> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>


Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran,

On Wed, 29 Aug 2018 at 22:26, Imran Rashid 
wrote:

> Hi Li,
>
> yes that makes perfect sense.  That more-or-less is the same as my view,
> though I framed it differently.  I guess in that case, I'm really asking:
>
> Can pyspark changes please be accompanied by more unit tests, and not
> assume we're getting coverage from doctests?
>

I don't think such assumptions are made, or at least I haven't seen any
evidence of that.

 However,  we often assume that particular components are already tested in
Scala API (SQL, ML), and intentionally don't repeat these tests.


>
> Imran
>
> On Wed, Aug 29, 2018 at 2:02 PM Li Jin  wrote:
>
>> Hi Imran,
>>
>> My understanding is that doctests and unittests are orthogonal - doctests
>> are used to make sure docstring examples are correct and are not meant to
>> replace unittests.
>> Functionalities are covered by unit tests to ensure correctness and
>> doctests are used to test the docstring, not the functionalities itself.
>>
>> There are issues with doctests, for example, we cannot test arrow related
>> functions in doctest because of pyarrow is optional dependency, but I think
>> that's a separate issue.
>>
>> Does this make sense?
>>
>> Li
>>
>> On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid 
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to propose that we move away from such heavy reliance on
>>> doctests in python, and move towards more traditional unit tests.  The main
>>> reason is that its hard to share test code in doc tests.  For example, I
>>> was just looking at
>>>
>>> https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
>>>  and wondering if we had any tests for some of the pyspark changes.
>>> SparkSession.createDataFrame has doctests, but those are just run with one
>>> standard spark configuration, which does not enable arrow.  Its hard to
>>> easily reuse that test, just with another spark context with a different
>>> conf.  Similarly I've wondered about reusing test cases but with
>>> local-cluster instead of local mode.  I feel like they also discourage
>>> writing a test which tries to get more exhaustive coverage on corner cases.
>>>
>>> I'm not saying we should stop using doctests -- I see why they're nice.
>>> I just think they should really only be when you want that code snippet in
>>> the doc anyway, so you might as well test it.
>>>
>>> Admittedly, I'm not really a python-developer, so I could be totally
>>> wrong about the right way to author doctests -- pushback welcome!
>>>
>>> Thoughts?
>>>
>>> thanks,
>>> Imran
>>>
>>


Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions:


   - https://stackoverflow.com/q/41670103/1560062
   - https://stackoverflow.com/q/42465568/1560062
   - https://stackoverflow.com/q/41670103/1560062

it is probably more "nobody thought about asking",  than "it is not used
often".

On Wed, 22 Aug 2018 at 00:07, Reynold Xin  wrote:

> Probably just because it is not used that often and nobody has submitted a
> patch for it. I've used pivot probably on average once a week (primarily in
> spreadsheets), but I've never used unpivot ...
>
>
> On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali  wrote:
>
>> Hi there,
>>
>> I was looking into why the UNPIVOT feature isn't implemented, given that
>> Spark already has PIVOT implemented natively in the DataFrame/Dataset API.
>>
>> Came across this JIRA  
>> which
>> talks about implementing PIVOT in Spark 1.6, but no mention whatsoever
>> regarding UNPIVOT, even though the JIRA curiously references a blog post
>> that talks about both PIVOT and UNPIVOT :)
>>
>> Is this because UNPIVOT is just simply generating multiple slim tables by
>> selecting each column, and making a union out of all of them?
>>
>> Thank you!
>>
>> --
>> Regards,
>>
>>
>> Ivan Gozali
>> Lecida
>> Email: i...@lecida.com
>>
>


Re: Increase Timeout or optimize Spark UT?

2017-08-24 Thread Maciej Szymkiewicz
It won't be used by PySpark and SparkR, will it?

On 23 August 2017 at 23:40, Michael Armbrust <mich...@databricks.com> wrote:

> I think we already set the number of partitions to 5 in tests
> <https://github.com/apache/spark/blob/6942aeeb0a0095a1ba85a817eb9e0edc410e5624/sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala#L60-L61>
> ?
>
> On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> From my experience it is possible to cut quite a lot by reducing
>> spark.sql.shuffle.partitions to some reasonable value (let's say
>> comparable to the number of cores). 200 is a serious overkill for most of
>> the test cases anyway.
>>
>>
>> Best,
>> Maciej
>>
>>
>>
>> On 21 August 2017 at 03:00, Dong Joon Hyun <dh...@hortonworks.com> wrote:
>>
>>> +1 for any efforts to recover Jenkins!
>>>
>>>
>>>
>>> Thank you for the direction.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> *From: *Reynold Xin <r...@databricks.com>
>>> *Date: *Sunday, August 20, 2017 at 5:53 PM
>>> *To: *Dong Joon Hyun <dh...@hortonworks.com>
>>> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>
>>> *Subject: *Re: Increase Timeout or optimize Spark UT?
>>>
>>>
>>>
>>> It seems like it's time to look into how to cut down some of the test
>>> runtimes. Test runtimes will slowly go up given the way development
>>> happens. 3 hr is already a very long time for tests to run.
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun <dh...@hortonworks.com>
>>> wrote:
>>>
>>> Hi, All.
>>>
>>>
>>>
>>> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6)
>>> has been hitting the build timeout.
>>>
>>>
>>>
>>> Please see the build time trend.
>>>
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>>> t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>>>
>>>
>>>
>>> All recent 22 builds fail due to timeout directly/indirectly. The last
>>> success (SBT with Hadoop-2.7) is 15th August.
>>>
>>>
>>>
>>> We may do the followings.
>>>
>>>
>>>
>>>1. Increase Build Timeout (3 hr 30 min)
>>>2. Optimize UTs (Scala/Java/Python/UT)
>>>
>>>
>>>
>>> But, Option 1 will be the immediate solution for now . Could you update
>>> the Jenkins setup?
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>
>>
>


-- 

Z poważaniem,
Maciej Szymkiewicz


Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Maciej Szymkiewicz
Hi,

>From my experience it is possible to cut quite a lot by reducing
spark.sql.shuffle.partitions to some reasonable value (let's say comparable
to the number of cores). 200 is a serious overkill for most of the test
cases anyway.


Best,
Maciej



On 21 August 2017 at 03:00, Dong Joon Hyun  wrote:

> +1 for any efforts to recover Jenkins!
>
>
>
> Thank you for the direction.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
> *From: *Reynold Xin 
> *Date: *Sunday, August 20, 2017 at 5:53 PM
> *To: *Dong Joon Hyun 
> *Cc: *"dev@spark.apache.org" 
> *Subject: *Re: Increase Timeout or optimize Spark UT?
>
>
>
> It seems like it's time to look into how to cut down some of the test
> runtimes. Test runtimes will slowly go up given the way development
> happens. 3 hr is already a very long time for tests to run.
>
>
>
>
>
> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
> wrote:
>
> Hi, All.
>
>
>
> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has
> been hitting the build timeout.
>
>
>
> Please see the build time trend.
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>
>
>
> All recent 22 builds fail due to timeout directly/indirectly. The last
> success (SBT with Hadoop-2.7) is 15th August.
>
>
>
> We may do the followings.
>
>
>
>1. Increase Build Timeout (3 hr 30 min)
>2. Optimize UTs (Scala/Java/Python/UT)
>
>
>
> But, Option 1 will be the immediate solution for now . Could you update
> the Jenkins setup?
>
>
>
> Bests,
>
> Dongjoon.
>
>
>


Re: Possible bug: inconsistent timestamp behavior

2017-08-15 Thread Maciej Szymkiewicz
These two are just not equivalent.

Spark SQL interprets long as seconds when casting between timestamps and
numerics, therefore
lit(148550335L).cast(org.apache.spark.sql.types.TimestampType)
represents 49043-09-23 21:26:400.0. This behavior is intended - see for
example https://issues.apache.org/jira/browse/SPARK-11724

java.sql.Timestamp expects milliseconds as an argument therefore lit(new
java.sql.Timestamp(148550335L)) represents 2017-01-27 08:49:10
.

On 15 August 2017 at 13:16, assaf.mendelson <assaf.mendel...@rsa.com> wrote:

> Hi all,
>
> I encountered weird behavior for timestamp. It seems that when using lit
> to add it to column, the timestamp goes from milliseconds representation to
> seconds representation:
>
>
>
>
>
> scala> spark.range(1).withColumn("a", lit(new java.sql.Timestamp(
> 148550335L)).cast("long")).show()
>
> +---+--+
>
> | id| a|
>
> +---+--+
>
> |  0|1485503350|
>
> +---+--+
>
>
>
>
>
> scala> spark.range(1).withColumn("a", lit(148550335L).cast(org.
> apache.spark.sql.types.TimestampType).cast(org.apache.spark.sql.types.
> LongType)).show()
>
> +---+-+
>
> | id|a|
>
> +---+-+
>
> |  0|148550335|
>
> +---+-+
>
>
>
>
>
> Is this a bug or am I missing something here?
>
>
>
> Thanks,
>
> Assaf
>
>
>
> --
> View this message in context: Possible bug: inconsistent timestamp
> behavior
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-inconsistent-timestamp-behavior-tp22144.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>



-- 

Z poważaniem,
Maciej Szymkiewicz


Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer:

https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py

which should at least partially address the problem.

On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> I just wanted to highlight some of the rough edges around using
> vectors in columns in dataframes. 
>
> If there is a null in a dataframe column containing vectors pyspark ml
> models like logistic regression will completely fail. 
>
> However from what i've read there is no good way to fill in these
> nulls with empty vectors. 
>
> Its not possible to create a literal vector column expressiong and
> coalesce it with the column from pyspark.
>  
> so we're left with writing a python udf which does this coalesce, this
> is really inefficient on large datasets and becomes a bottleneck for
> ml pipelines working with real world data.
>
> I'd like to know how other users are dealing with this and what plans
> there are to extend vector support for dataframes.
>
> Thanks!,
>
> Franklyn

-- 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: spark messing up handling of native dependency code?

2017-06-02 Thread Maciej Szymkiewicz
Maybe not related, but in general geotools are not thread safe,so using
from workers is most likely a gamble.

On 06/03/2017 01:26 AM, Georg Heiler wrote:
> Hi,
>
> There is a weird problem with spark when handling native dependency code:
> I want to use a library (JAI) with spark to parse some spatial raster
> files. Unfortunately, there are some strange issues. JAI only works
> when running via the build tool i.e. `sbt run` when executed in spark.
>
> When executed via spark-submit the error is:
>
> java.lang.IllegalArgumentException: The input argument(s) may not
> be null.
> at
> javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:157)
> at
> javax.media.jai.ParameterBlockJAI.(ParameterBlockJAI.java:178)
> at
> org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
>
> Which looks like some native dependency (I think GEOS is running in
> the background) is not there correctly.
>
> Assuming something is wrong with the class path I tried to run a plain
> java/scala function. but this one works just fine.
>
> Is spark messing with the class paths?
>
> I created a minimal example here:
> https://github.com/geoHeil/jai-packaging-problem
>
>
> Hope someone can shed some light on this problem,
> Regards,
> Georg 



Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz


On 05/23/2017 02:45 PM, Mendelson, Assaf wrote:
>
> You are correct,
>
> I actually did not look too deeply into it until now as I noticed you
> mentioned it is compatible with python 3 only and I saw in the github
> that mypy or pytype is required.
>
>  
>
> Because of that I made my suggestions with the thought of python 2.
>
>  
>
> Looking into it more deeply, I am wondering what is not supported? Are
> you talking about limitation for testing?
>

Since type checkers (unlike annotations) are not standardized, this
varies between projects and versions. For MyPy quite a lot changed since
I started annotating Spark.

Few months ago I wouldn't even bother looking at the list of issues,
today (as mentioned in the other message) we could remove metaclasses,
and pass both Python 2 and Python 3 checks.

The other part is typing module itself, as well as function annotations
(outside docstrings). But this is not a problem with stub files.
>
>  
>
> If I understand correctly then one can use this without any issues for
> pycharm (and other IDEs supporting the type hinting) even when
> developing for python 2.
>

This strictly depends on type checker. I didn't follow the development,
but I got this impression that a lot changed for example between PyCharm
2016.3 and 2017.1. I think that the important point is that lack of
support, doesn't break anything.
>
> In addition, the tests can test the existing pyspark, they just have
> to be run with a compatible packaging (e.g. mypy).
>
> Meaning that porting for python 2 would provide a very small advantage
> over the immediate advantages (IDE usage and testing for most cases).
>
>  
>
> Am I missing something?
>
>  
>
> Thanks,
>
>   Assaf.
>
>  
>
> *From:*Maciej Szymkiewicz [mailto:mszymkiew...@gmail.com]
> *Sent:* Tuesday, May 23, 2017 3:27 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [PYTHON] PySpark typing hints
>
>  
>
>  
>
>  
>
> On 05/23/2017 01:12 PM, assaf.mendelson wrote:
>
> That said, If we make a decision on the way to handle it then I
> believe it would be a good idea to start even with the bare
> minimum and continue to add to it (and therefore make it so many
> people can contribute). The code I added in github were basically
> the things I needed.
>
> I already have almost full coverage of the API, excluding some exotic
> part of the legacy streaming, so starting with bare minimum is not
> really required.
>
> The advantage of the first is that it is part of the code which means
> it is easier to make it updated. The main issue with this is that
> supporting auto generated code (as is the case in most functions) can
> be a little awkward and actually is a relate to a separate issue as it
> means pycharm marks most of the functions as an error (i.e.
> pyspark.sql.functions.XXX is marked as not there…)
>
>
> Comment based annotations are not suitable for complex signatures with
> multliversion support.
>
> Also there is no support for overloading, therefore it is not possible
> to capture relationship between arguments, and arguments and return type.
>

-- 
Maciej Szymkiewicz



signature.asc
Description: OpenPGP digital signature


Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
It doesn't break anything at all. You can take stub files as-is, put
these into PySpark root, and as long as users are not interested in type
checking, it won't have any runtime impact.

Surprisingly the current MyPy build (mypy==0.511) reports only one
incompatibility with Python 2 (dynamic metaclasses), which is could be
resolved without significant loss of function.

On 05/23/2017 12:08 PM, Reynold Xin wrote:
> Seems useful to do. Is there a way to do this so it doesn't break
> Python 2.x?
>
>
> On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi everyone,
>
> For the last few months I've been working on static type
> annotations for PySpark. For those of you, who are not familiar
> with the idea, typing hints have been introduced by PEP 484
> (https://www.python.org/dev/peps/pep-0484/
> <https://www.python.org/dev/peps/pep-0484/>) and further extended
> with PEP 526 (https://www.python.org/dev/peps/pep-0526/
> <https://www.python.org/dev/peps/pep-0526/>) with the main goal of
> providing information required for static analysis. Right now
> there a few tools which support typing hints, including Mypy
> (https://github.com/python/mypy <https://github.com/python/mypy>)
> and PyCharm
> 
> (https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html
> 
> <https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html>).
>  
> Type hints can be added using function annotations
> (https://www.python.org/dev/peps/pep-3107/
> <https://www.python.org/dev/peps/pep-3107/>, Python 3 only),
> docstrings, or source independent stub files
> (https://www.python.org/dev/peps/pep-0484/#stub-files
> <https://www.python.org/dev/peps/pep-0484/#stub-files>). Typing is
> optional, gradual and has no runtime impact.
>
> At this moment I've annotated majority of the API, including
> majority of pyspark.sql and pyspark.ml <http://pyspark.ml>. At
> this moment project is still rough around the edges, and may
> result in both false positive and false negatives, but I think it
> become mature enough to be useful in practice.
>
> The current version is compatible only with Python 3, but it is
> possible, with some limitations, to backport it to Python 2
> (though it is not on my todo list).
>
> There is a number of possible benefits for PySpark users and
> developers:
>
>   * Static analysis can detect a number of common mistakes to
> prevent runtime failures. Generic self is still fairly
> limited, so it is more useful with DataFrames, SS and ML than
> RDD, DStreams or RDD.
>   * Annotations can be used for documenting complex signatures
> (https://git.io/v95JN) including dependencies on arguments and
> value (https://git.io/v95JA).
>   * Detecting possible bugs in Spark (SPARK-20631) .
>   * Showing API inconsistencies.
>
> Roadmap
>
>   * Update the project to reflect Spark 2.2.
>   * Refine existing annotations.
>
> If there will be enough interest I am happy to contribute this
> back to Spark or submit to Typeshed
> (https://github.com/python/typeshed
> <https://github.com/python/typeshed> -  this would require a
> formal ASF approval, and since Typeshed doesn't provide
> versioning, is probably not the best option in our case).
>
> Further inforamtion:
>
>   * https://github.com/zero323/pyspark-stubs
> <https://github.com/zero323/pyspark-stubs> - GitHub repository
>
>   * 
> https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
> 
> <https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017>
> - interesting presentation by Marco Bonzanini
>
> -- 
> Best,
> Maciej
>
>

-- 
Maciej Szymkiewicz



signature.asc
Description: OpenPGP digital signature


[PYTHON] PySpark typing hints

2017-05-14 Thread Maciej Szymkiewicz
Hi everyone,

For the last few months I've been working on static type annotations for
PySpark. For those of you, who are not familiar with the idea, typing
hints have been introduced by PEP 484
(https://www.python.org/dev/peps/pep-0484/) and further extended with
PEP 526 (https://www.python.org/dev/peps/pep-0526/) with the main goal
of providing information required for static analysis. Right now there a
few tools which support typing hints, including Mypy
(https://github.com/python/mypy) and PyCharm
(https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html). 
Type hints can be added using function annotations
(https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings,
or source independent stub files
(https://www.python.org/dev/peps/pep-0484/#stub-files). Typing is
optional, gradual and has no runtime impact.

At this moment I've annotated majority of the API, including majority of
pyspark.sql and pyspark.ml. At this moment project is still rough around
the edges, and may result in both false positive and false negatives,
but I think it become mature enough to be useful in practice.

The current version is compatible only with Python 3, but it is
possible, with some limitations, to backport it to Python 2 (though it
is not on my todo list).

There is a number of possible benefits for PySpark users and developers:

  * Static analysis can detect a number of common mistakes to prevent
runtime failures. Generic self is still fairly limited, so it is
more useful with DataFrames, SS and ML than RDD, DStreams or RDD.
  * Annotations can be used for documenting complex signatures
(https://git.io/v95JN) including dependencies on arguments and value
(https://git.io/v95JA).
  * Detecting possible bugs in Spark (SPARK-20631) .
  * Showing API inconsistencies.

Roadmap

  * Update the project to reflect Spark 2.2.
  * Refine existing annotations.

If there will be enough interest I am happy to contribute this back to
Spark or submit to Typeshed (https://github.com/python/typeshed -  this
would require a formal ASF approval, and since Typeshed doesn't provide
versioning, is probably not the best option in our case).

Further inforamtion:

  * https://github.com/zero323/pyspark-stubs - GitHub repository

  * 
https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
- interesting presentation by Marco Bonzanini

-- 
Best,
Maciej



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Maciej Szymkiewicz
I am not sure if it is relevant but explode_outer and posexplode_outer
seem to be broken: SPARK-20534



On 04/28/2017 12:49 AM, Sean Owen wrote:
> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust
> > wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc1
>  
> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be
> found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
> 
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
> 
> 
>
>
> *FAQ*
>
> *How can I help test this release?*
> *
> *
> If you are a Spark user, you can help us test this release by
> taking an existing Spark workload and running on this release
> candidate, then reporting any regressions.
> *
> *
> *What should happen to JIRA tickets still targeting 2.2.0?*
> *
> *
> Committers should look at those and triage. Extremely important
> bug fixes, documentation, and API tweaks that impact compatibility
> should be worked on immediately. Everything else please retarget
> to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>



[SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Maciej Szymkiewicz
Forwarded from SO (http://stackoverflow.com/q/43007433). Looks like
regression compared to 2.0.2.

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val win_spec_max =
Window.partitionBy("x").orderBy("AmtPaid").rowsBetween(Window.unboundedPreceding,
0)
win_spec_max: org.apache.spark.sql.expressions.WindowSpec =
org.apache.spark.sql.expressions.WindowSpec@3433e418

scala> val df = Seq((1, 2.0), (1, 3.0), (1, 1.0), (1, -2.0), (1,
-1.0)).toDF("x", "AmtPaid")
df: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double]

scala> val df_with_sum = df.withColumn("AmtPaidCumSum",
sum(col("AmtPaid")).over(win_spec_max))
df_with_sum: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
... 1 more field]

scala> val df_with_max = df_with_sum.withColumn("AmtPaidCumSumMax",
max(col("AmtPaidCumSum")).over(win_spec_max))
df_with_max: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
... 2 more fields]

scala> df_with_max.explain
== Physical Plan ==
!Window [sum(AmtPaid#361) windowspecdefinition(x#360, AmtPaid#361 ASC
NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS
AmtPaidCumSum#366, max(AmtPaidCumSum#366) windowspecdefinition(x#360,
AmtPaid#361 ASC NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) AS AmtPaidCumSumMax#372], [x#360], [AmtPaid#361 ASC NULLS
FIRST]
+- *Sort [x#360 ASC NULLS FIRST, AmtPaid#361 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(x#360, 200)
  +- LocalTableScan [x#360, AmtPaid#361]

scala> df_with_max.printSchema
root
 |-- x: integer (nullable = false)
 |-- AmtPaid: double (nullable = false)
 |-- AmtPaidCumSum: double (nullable = true)
 |-- AmtPaidCumSumMax: double (nullable = true)

scala> df_with_max.show
17/03/24 21:22:32 ERROR Executor: Exception in task 0.0 in stage 19.0
(TID 234)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
attribute, tree: AmtPaidCumSum#366
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
   ...
Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366
in [sum#385,max#386,x#360,AmtPaid#361]
   ...

Is it a known issue or do we need a JIRA?

-- 
Best,
Maciej Szymkiewicz


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[ML][PYTHON] Collecting data in a class extending SparkSessionTestCase causes AttributeError:

2017-03-06 Thread Maciej Szymkiewicz
Hi everyone,

It is a either to late or to early for me to think straight so please
forgive me if it is something trivial. I am trying to add a test case
extending SparkSessionTestCase to pyspark.ml.tests (example patch
attached). If test collects data, and there is another TestCase
extending extending SparkSessionTestCase executed before it, I get
AttributeError due to _jsc being None:

==

ERROR: test_foo (pyspark.ml.tests.FooTest)

--

Traceback (most recent call last):

  File "/home/spark/python/pyspark/ml/tests.py", line 1258, in test_foo

  File "/home/spark/python/pyspark/sql/dataframe.py", line 389, in collect

with SCCallSiteSync(self._sc) as css:

  File "/home/spark/python/pyspark/traceback_utils.py", line 72, in __enter__

self._context._jsc.setCallSite(self._call_site)

AttributeError: 'NoneType' object has no attribute 'setCallSite'

--

If TestCase is executed alone it seems to work just fine.


Can anyone reproduce this? Is there something obvious I miss here?

-- 
Best,
Maciej

diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 3524160557..cc6e49d6cf 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1245,6 +1245,17 @@ class ALSTest(SparkSessionTestCase):
 self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
 self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
 
+als.fit(df).userFactors.collect()
+
+
+class FooTest(SparkSessionTestCase):
+def test_foo(self):
+df = self.spark.createDataFrame(
+[(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), 
(2, 2, 5.0)],
+["user", "item", "rating"])
+als = ALS().setMaxIter(1).setRank(1)
+als.fit(df).userFactors.collect()
+
 
 class DefaultValuesTests(PySparkTestCase):
 """


signature.asc
Description: OpenPGP digital signature


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
users which depend on system wide Python installations. As far as I
am aware neither Py4j nor cloudpickle are present in the standard
system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
end of life we have to be sure that any external dependency has the
same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
a small annoyance.
  * External dependencies with pinned versions add some overhead to the
development across versions (effectively we may need a separate env
for each major Spark release). I've seen small inconsistencies in
PySpark behavior with different Py4j versions so it is not
completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
something to consider (for example in combination Blaze + Dask +
PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <r...@databricks.com
> <mailto:r...@databricks.com>> wrote:
>
> With any dependency update (or refactoring of existing code), I
> always ask this question: what's the benefit? In this case it
> looks like the benefit is to reduce efforts in backports. Do you
> know how often we needed to do those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
> <hol...@pigscanfly.ca <mailto:hol...@pigscanfly.ca>> wrote:
>
> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally
> copied from (and improved from) picloud. Since then other
> projects have found cloudpickle useful and a fork of
> cloudpickle <https://github.com/cloudpipe/cloudpickle> was
> created and is now maintained as its own library
> <https://pypi.python.org/pypi/cloudpickle> (with better test
> coverage and resulting bug fixes I understand). We've had a
> few PRs backporting fixes from the cloudpickle project into
> Spark's local copy of cloudpickle - how would people feel
> about moving to taking an explicit (pinned) dependency on
> cloudpickle?
>
> We could add cloudpickle to the setup.py and a
> requirements.txt file for users who prefer not to do a system
> installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of
> py4j in our repo but could instead have a pinned version
> required. While we do depend on a lot of py4j internal APIs,
> version pinning should be sufficient to ensure functionality
> (and simplify the update process).
>
> Cheers,
>
>     Holden :)
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz



Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Maciej Szymkiewicz
Congratulations!


On 02/13/2017 08:16 PM, Reynold Xin wrote:
> Hi all,
>
> Takuya-san has recently been elected an Apache Spark committer. He's
> been active in the SQL area and writes very small, surgical patches
> that are high quality. Please join me in congratulating Takuya-san!
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-03 Thread Maciej Szymkiewicz
antly larger data sets. So I
>>>> guess the question I am really interested in is - what changed between
>>>> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
>>>> current master) to cause this and more important, is it a feature or is
>>>> it a bug? I admit, I choose a lazy path here, and didn't spend much time
>>>> (yet) trying to dig deeper.
>>>>
>>>> I can see a bit higher memory usage, a bit more intensive GC activity,
>>>> but nothing I would really blame for this behavior, and duration of
>>>> individual jobs is comparable with some favor of 2.x. Neither
>>>> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
>>>> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
>>>> the problem doesn't look that related to the data processing part in the
>>>> first place.
>>>>
>>>>
>>>> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
>>>>> Hi Maciej,
>>>>>
>>>>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>>>>
>>>>>
>>>>> Liang-Chi Hsieh wrote
>>>>>> Hi Maciej,
>>>>>>
>>>>>> Basically the fitting algorithm in Pipeline is an iterative operation.
>>>>>> Running iterative algorithm on Dataset would have RDD lineages and
>>>>>> query
>>>>>> plans that grow fast. Without cache and checkpoint, it gets slower
>>>>>> when
>>>>>> the iteration number increases.
>>>>>>
>>>>>> I think it is why when you run a Pipeline with long stages, it gets
>>>>>> much
>>>>>> longer time to finish. As I think it is not uncommon to have long
>>>>>> stages
>>>>>> in a Pipeline, we should improve this. I will submit a PR for this.
>>>>>> zero323 wrote
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> While experimenting with ML pipelines I experience a significant
>>>>>>> performance regression when switching from 1.6.x to 2.x.
>>>>>>>
>>>>>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>>>>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>>>>>> VectorAssembler}
>>>>>>>
>>>>>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>>>>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>>>>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>>>>>   .setInputCol(c)
>>>>>>>   .setOutputCol(s"${c}_indexed")
>>>>>>>   .setHandleInvalid("skip"))
>>>>>>>
>>>>>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>>>>>   .setInputCol(indexer.getOutputCol)
>>>>>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>>>>>   .setDropLast(true))
>>>>>>>
>>>>>>> val assembler = new
>>>>>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>>>>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>>>>>
>>>>>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>>>>>
>>>>>>> Task execution time is comparable and executors are most of the time
>>>>>>> idle so it looks like it is a problem with the optimizer. Is it a
>>>>>>> known
>>>>>>> issue? Are there any changes I've missed, that could lead to this
>>>>>>> behavior?
>>>>>>>
>>>>>>> -- 
>>>>>>> Best,
>>>>>>> Maciej
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: 
>>>>>>> dev-unsubscribe@.apache
>>>>>
>>>>>
>>>>>
>>>>> -
>>>>> Liang-Chi Hsieh | @viirya 
>>>>> Spark Technology Center 
>>>>> http://www.spark.tc/ 
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
>>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: 
>>>> dev-unsubscribe@.apache
>>>> -- 
>>>> Maciej Szymkiewicz
>>>>
>>>>
>>>>
>>>> nM15AWH.png (19K)
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/0/nM15AWH.png;
>>>> KHZa7hL.png (26K)
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/1/KHZa7hL.png;
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya 
> Spark Technology Center 
> http://www.spark.tc/ 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20837.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
Hi Liang-Chi,

Thank you for your answer and PR but what I think I wasn't specific
enough. In hindsight I should have illustrate this better. What really
troubles me here is a pattern of growing delays. Difference between
1.6.3 (roughly 20s runtime since the first job):


1.6 timeline

vs 2.1.0 (45 minutes or so in a bad case):

2.1.0 timeline

The code is just an example and it is intentionally dumb. You easily
mask this with caching, or using significantly larger data sets. So I
guess the question I am really interested in is - what changed between
1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
current master) to cause this and more important, is it a feature or is
it a bug? I admit, I choose a lazy path here, and didn't spend much time
(yet) trying to dig deeper.

I can see a bit higher memory usage, a bit more intensive GC activity,
but nothing I would really blame for this behavior, and duration of
individual jobs is comparable with some favor of 2.x. Neither
StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
the problem doesn't look that related to the data processing part in the
first place.


On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>>
>> Basically the fitting algorithm in Pipeline is an iterative operation.
>> Running iterative algorithm on Dataset would have RDD lineages and query
>> plans that grow fast. Without cache and checkpoint, it gets slower when
>> the iteration number increases.
>>
>> I think it is why when you run a Pipeline with long stages, it gets much
>> longer time to finish. As I think it is not uncommon to have long stages
>> in a Pipeline, we should improve this. I will submit a PR for this.
>> zero323 wrote
>>> Hi everyone,
>>>
>>> While experimenting with ML pipelines I experience a significant
>>> performance regression when switching from 1.6.x to 2.x.
>>>
>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>> VectorAssembler}
>>>
>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>   .setInputCol(c)
>>>   .setOutputCol(s"${c}_indexed")
>>>   .setHandleInvalid("skip"))
>>>
>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>   .setInputCol(indexer.getOutputCol)
>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>   .setDropLast(true))
>>>
>>> val assembler = new
>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>
>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>
>>> Task execution time is comparable and executors are most of the time
>>> idle so it looks like it is a problem with the optimizer. Is it a known
>>> issue? Are there any changes I've missed, that could lead to this
>>> behavior?
>>>
>>> -- 
>>> Best,
>>> Maciej
>>>
>>>
>>> -
>>> To unsubscribe e-mail: 
>>> dev-unsubscribe@.apache
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya 
> Spark Technology Center 
> http://www.spark.tc/ 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-- 
Maciej Szymkiewicz



[SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-01-31 Thread Maciej Szymkiewicz
Hi everyone,

While experimenting with ML pipelines I experience a significant
performance regression when switching from 1.6.x to 2.x.

import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
VectorAssembler}

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
"baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val assembler = new
VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler

new Pipeline().setStages(stages).fit(df).transform(df).show

Task execution time is comparable and executors are most of the time
idle so it looks like it is a problem with the optimizer. Is it a known
issue? Are there any changes I've missed, that could lead to this behavior?

-- 
Best,
Maciej


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both
calendar and unsafe in their fully qualified names but if there is no
bigger picture I missed here I would go with 1 as well. And of course
fix the error message. I understand this has been introduced with
structured streaming in mind, but it is an useful feature in general,
not only in high precision scale. To be honest I would love to see some
generalized version which could be used (I mean without hacking) with
arbitrary numeric sequence. It could address at least some scenarios in
which people try to use window functions without PARTITION BY clause and
fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the
problem, does it? If one were to nitpick it doesn't look like this
implementation even touches all the subtleties of DST or leap second.



On 01/18/2017 05:52 PM, Burak Yavuz wrote:
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix
> the implementation. I'll leave it to the community to comment on. The
> code right now disallows intervals provided in months and years,
> because they are not a "consistently" fixed amount of time. A month
> can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it
> 360 days (sometimes used in finance), 365 days or 366 days? 
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> Can I ask for some clarifications regarding intended behavior of
> window / TimeWindow?
>
> PySpark documentation states that "Windows in the order of months
> are not supported". This is further confirmed by the checks in
> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month
> by expressing interval in days or another unit of a higher
> precision. So this fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be
> assigned only if there is a match against years or months
> https://git.io/vMPdi) or expected behavior and I simply
> misunderstood the intentions?
>
> -- 
> Best,
> Maciej
>
>



[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi,

Can I ask for some clarifications regarding intended behavior of window
/ TimeWindow?

PySpark documentation states that "Windows in the order of months are
not supported". This is further confirmed by the checks in
TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).

Surprisingly enough we can pass interval much larger than a month by
expressing interval in days or another unit of a higher precision. So
this fails:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))

while following is accepted:

Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))

with results which look sensible at first glance.

Is it a matter of a faulty validation logic (months will be assigned
only if there is a match against years or months https://git.io/vMPdi)
or expected behavior and I simply misunderstood the intentions?

-- 
Best,
Maciej



Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
Thanks Holden. If you have some spare time would you take a look at
https://github.com/apache/spark/pull/16534?

It is somewhat related to
https://issues.apache.org/jira/browse/SPARK-18777 (Return UDF objects
when registering from Python).


On 01/12/2017 07:34 PM, Holden Karau wrote:
> I'd be happy to help with reviewing Python test improvements. Maybe
> make an umbrella JIRA and do one sub components at a time?
>
> On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal <sxk1...@hotmail.com
> <mailto:sxk1...@hotmail.com>> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Following up, any thoughts on next steps for this?
>
>
>
>
>
>
>
>
>
>
>
>     
>
>
> *From:* Maciej Szymkiewicz <mszymkiew...@gmail.com
> <mailto:mszymkiew...@gmail.com>>
>
>
>
> *Sent:* Wednesday, January 11, 2017 10:14 AM
>
>
> *To:* Saikat Kanjilal
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
> Not yet, I want to see if there is any consensus about it. It is a
> lot of tedious work and I would be shame if someone started
> working on this just to get it dropped.
>
>
>
>
>
>
>
> On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:
>
>
>
>
>>
>>
>>
>>
>> Hello Maciej,
>>
>>
>>
>> If there's a jira available for this I'd like to help get this
>> moving, let me know next steps.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>>
>> *From:* Maciej Szymkiewicz
>>
>> <mszymkiew...@gmail.com> <mailto:mszymkiew...@gmail.com>
>>
>>
>> *Sent:* Wednesday, January 11, 2017 4:18 AM
>>
>>
>> *To:*
>>
>> dev@spark.apache.org <mailto:dev@spark.apache.org>
>>
>>
>> *Subject:* [PYSPARK] Python tests organization
>>
>>  
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>>
>>
>> I can't help but wonder if there is any practical reason for keeping
>>
>>
>> monolithic test modules. These things are already pretty large
>> (1500 -
>>
>>
>> 2200 LOCs) and can only grow. Development aside, I assume that many
>>
>>
>> users use tests the same way as me, to check the intended
>> behavior, and
>>
>>
>> largish loosely coupled modules make it harder than it should be.
>>
>>
>>
>>
>>
>> If there's no rationale for that it could be a good time start
>> thinking
>>
>>
>> about moving tests to packages and separating into modules reflecting
>>
>>
>> project structure.
>>
>>
>>
>>
>>
>> -- 
>>
>>
>> Best,
>>
>>
>> Maciej
>>
>>
>>
>>
>>
>>
>>
>>
>> -
>>
>>
>> To unsubscribe e-mail:
>>
>> dev-unsubscr...@spark.apache.org
>> <mailto:dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
> -- 
> Maciej Szymkiewicz
>
>
>

-- 
Maciej Szymkiewicz



Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
Sounds good, but it looks like JIRA is still down.


Personally I can look at sql.tests and see what can be done there.
Depending on the resolution of
https://issues.apache.org/jira/browse/SPARK-19160 I may have to adjust
some tests anyway.


On 01/12/2017 07:36 PM, Saikat Kanjilal wrote:
>
> Maciej? LGTM, what do you think?  I can create a JIRA and drive this.
>
>
>
> 
> *From:* Holden Karau <hol...@pigscanfly.ca>
> *Sent:* Thursday, January 12, 2017 10:34 AM
> *To:* Saikat Kanjilal; dev@spark.apache.org
> *Subject:* Re: [PYSPARK] Python tests organization
>  
> I'd be happy to help with reviewing Python test improvements. Maybe
> make an umbrella JIRA and do one sub components at a time?
>
> On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal <sxk1...@hotmail.com
> <mailto:sxk1...@hotmail.com>> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Following up, any thoughts on next steps for this?
>
>
>
>
>
>
>
>
>
>
>
> 
>
>
> *From:* Maciej Szymkiewicz <mszymkiew...@gmail.com
> <mailto:mszymkiew...@gmail.com>>
>
>
>
> *Sent:* Wednesday, January 11, 2017 10:14 AM
>
>
> *To:* Saikat Kanjilal
>
>
> *Subject:* Re: [PYSPARK] Python tests organization
>
>
> Not yet, I want to see if there is any consensus about it. It is a
> lot of tedious work and I would be shame if someone started
> working on this just to get it dropped.
>
>
>
>
>
>
>
> On 01/11/2017 06:44 PM, Saikat Kanjilal wrote:
>
>
>
>
>>
>>
>>
>>
>> Hello Maciej,
>>
>>
>>
>> If there's a jira available for this I'd like to help get this
>> moving, let me know next steps.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>>
>> *From:* Maciej Szymkiewicz
>>
>> <mszymkiew...@gmail.com> <mailto:mszymkiew...@gmail.com>
>>
>>
>> *Sent:* Wednesday, January 11, 2017 4:18 AM
>>
>>
>> *To:*
>>
>> dev@spark.apache.org <mailto:dev@spark.apache.org>
>>
>>
>> *Subject:* [PYSPARK] Python tests organization
>>
>>  
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>>
>>
>> I can't help but wonder if there is any practical reason for keeping
>>
>>
>> monolithic test modules. These things are already pretty large
>> (1500 -
>>
>>
>> 2200 LOCs) and can only grow. Development aside, I assume that many
>>
>>
>> users use tests the same way as me, to check the intended
>> behavior, and
>>
>>
>> largish loosely coupled modules make it harder than it should be.
>>
>>
>>
>>
>>
>> If there's no rationale for that it could be a good time start
>> thinking
>>
>>
>> about moving tests to packages and separating into modules reflecting
>>
>>
>> project structure.
>>
>>
>>
>>
>>
>> -- 
>>
>>
>> Best,
>>
>>
>> Maciej
>>
>>
>>
>>
>>
>>
>>
>>
>> -
>>
>>
>> To unsubscribe e-mail:
>>
>> dev-unsubscr...@spark.apache.org
>> <mailto:dev-unsubscr...@spark.apache.org>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
> -- 
> Maciej Szymkiewicz
>
>
>

-- 
Maciej Szymkiewicz



[PYSPARK] Python tests organization

2017-01-11 Thread Maciej Szymkiewicz
Hi,

I can't help but wonder if there is any practical reason for keeping
monolithic test modules. These things are already pretty large (1500 -
2200 LOCs) and can only grow. Development aside, I assume that many
users use tests the same way as me, to check the intended behavior, and
largish loosely coupled modules make it harder than it should be.

If there's no rationale for that it could be a good time start thinking
about moving tests to packages and separating into modules reflecting
project structure.

-- 
Best,
Maciej


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL][PYTHON] UDF improvements.

2017-01-10 Thread Maciej Szymkiewicz
Thanks for your response Ryan. Here you are
https://issues.apache.org/jira/browse/SPARK-19159


On 01/09/2017 07:30 PM, Ryan Blue wrote:
> Maciej, this looks great.
>
> Could you open a JIRA issue for improving the @udf decorator and
> possibly sub-tasks for the specific features from the gist? Thanks!
>
> rb
>
> On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> I've been looking at the PySpark UserDefinedFunction and I have a
> couple of suggestions how it could be improved including:
>
>   * Full featured decorator syntax.
>   * Docstring handling improvements.
>   * Lazy initialization.
>
> I summarized all suggestions with links to possible solutions in
> gist
> (https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4
> <https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4>)
> and I'll be happy to open a JIRA and submit a PR if there is any
> interest in that.
>
> -- 
> Best,
> Maciej
>
>
>
>
> -- 
> Ryan Blue
> Software Engineer
> Netflix

-- 
Best,
Maciej



[SQL][PYTHON] UDF improvements.

2017-01-07 Thread Maciej Szymkiewicz
Hi,

I've been looking at the PySpark UserDefinedFunction and I have a couple
of suggestions how it could be improved including:

  * Full featured decorator syntax.
  * Docstring handling improvements.
  * Lazy initialization.

I summarized all suggestions with links to possible solutions in gist
(https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4) and
I'll be happy to open a JIRA and submit a PR if there is any interest in
that.

-- 
Best,
Maciej



Re: shapeless in spark 2.1.0

2016-12-29 Thread Maciej Szymkiewicz
Breeze 0.13 (RC-1 right now) bumps shapeless to 2.2.0 and 2.2.5 for
Scala 2.10 and 2.11 respectively:

https://github.com/scalanlp/breeze/pull/509

On 12/29/2016 07:13 PM, Ryan Williams wrote:
>
> Other option would presumably be for someone to make a release of
> breeze with old-shapeless shaded... unless shapeless classes are
> exposed in breeze's public API, in which case you'd have to copy the
> relevant shapeless classes into breeze and then publish that?
>
>
> On Thu, Dec 29, 2016, 1:05 PM Sean Owen <so...@cloudera.com
> <mailto:so...@cloudera.com>> wrote:
>
> It is breeze, but, what's the option? It can't be excluded. I
> think this falls in the category of things an app would need to
> shade in this situation. 
>
> On Thu, Dec 29, 2016, 16:49 Koert Kuipers <ko...@tresata.com
> <mailto:ko...@tresata.com>> wrote:
>
> i just noticed that spark 2.1.0 bring in a new transitive
> dependency on shapeless 2.0.0
>
> shapeless is a popular library for scala users, and shapeless
> 2.0.0 is old (2014) and not compatible with more current versions.
>
> so this means a spark user that uses shapeless in his own
> development cannot upgrade safely from 2.0.0 to 2.1.0, i think.
>
> wish i had noticed this sooner
>

-- 
Maciej Szymkiewicz



Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
Iterative union like this creates a deeply nested recursive structure in
a similar manner to described here http://stackoverflow.com/q/34461804

You can try something like this http://stackoverflow.com/a/37612978 but
there is of course on overhead of conversion between Dataset and RDD.


On 12/29/2016 06:21 PM, assaf.mendelson wrote:
>
> Hi,
>
>  
>
> I have been playing around with doing union between a large number of
> dataframes and saw that the performance of the actual union (not the
> action) is worse than O(N^2). Since a union basically defines a
> lineage (i.e. current + union with of other as a child) this should be
> almost instantaneous, however in practice this can be very costly.
>
>  
>
> I was wondering why this is and if there is a way to fix this.
>
>  
>
> A sample test:
>
> *def *testUnion(n: Int): Long = {
>   *val *dataframes = *for *{
> x <- 0 until n
>   } *yield */spark/.range(1000)
>
>   *val *t0 = System./currentTimeMillis/()
>   *val *allDF = dataframes.reduceLeft(_.union(_))
>   *val *t1 = System./currentTimeMillis/()
>   *val *totalTime = t1 - t0
>   /println/(*s"**$*totalTime*miliseconds"*)
>   totalTime
> }
>
>  
>
> scala> testUnion(100)
>
> 193 miliseconds
>
> res5: Long = 193
>
>  
>
> scala> testUnion(200)
>
> 759 miliseconds
>
> res1: Long = 759
>
>  
>
> scala> testUnion(500)
>
> 4438 miliseconds
>
> res2: Long = 4438
>
>  
>
> scala> testUnion(1000)
>
> 18441 miliseconds
>
> res6: Long = 18441
>
>  
>
> scala> testUnion(2000)
>
> 88498 miliseconds
>
> res7: Long = 88498
>
>  
>
> scala> testUnion(5000)
>
> 822305 miliseconds
>
> res8: Long = 822305
>
>  
>
>  
>
>
> 
> View this message in context: repeated unioning of dataframes take
> worse than O(N^2) time
> <http://apache-spark-developers-list.1001551.n3.nabble.com/repeated-unioning-of-dataframes-take-worse-than-O-N-2-time-tp20394.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz



Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
This sounds much better.

Follow up question is if we should provide MAP@k, which I believe is
wider used metric.


On 12/06/2016 09:52 PM, Sean Owen wrote:
> As I understand, this might best be called "mean precision@k", not
> "mean average precision, up to k".
>
> On Tue, Dec 6, 2016 at 9:43 PM Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Thank you Sean.
>
> Maybe I am just confused about the language. When I read that it
> returns "the average precision at the first k ranking positions" I
> somehow expect there will ap@k there and a the final output would
> be MAP@k not average precision at the k-th position.
>
> I guess it is not enough sleep.
>
> On 12/06/2016 02:45 AM, Sean Owen wrote:
>> I read it again and that looks like it implements mean
>> precision@k as I would expect. What is the issue?
>>
>> On Tue, Dec 6, 2016, 07:30 Maciej Szymkiewicz
>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> Hi,
>>
>> Could I ask fora fresh pair of eyes on this piece of code:
>>
>> 
>> https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80
>>
>>   @Since("1.2.0")
>>   def precisionAt(k: Int): Double = {
>> require(k > 0, "ranking position k should be positive")
>> predictionAndLabels.map { case (pred, lab) =>
>>   val labSet = lab.toSet
>>
>>   if (labSet.nonEmpty) {
>> val n = math.min(pred.length, k)
>> var i = 0
>> var cnt = 0
>> while (i < n) {
>>   if (labSet.contains(pred(i))) {
>> cnt += 1
>>   }
>>   i += 1
>> }
>> cnt.toDouble / k
>>   } else {
>> logWarning("Empty ground truth set, check input data")
>> 0.0
>>   }
>> }.mean()
>>   }
>>
>>
>> Am I the only one who thinks this doesn't do what it claims?
>>     Just for reference:
>>
>>   * 
>> https://web.archive.org/web/20120415101144/http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf
>>   * 
>> https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
>>
>> -- 
>> Best,
>> Maciej
>>
>
> -- 
> Maciej Szymkiewicz
>

-- 
Maciej Szymkiewicz



Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
Thank you Sean.

Maybe I am just confused about the language. When I read that it returns
"the average precision at the first k ranking positions" I somehow
expect there will ap@k there and a the final output would be MAP@k not
average precision at the k-th position.

I guess it is not enough sleep.

On 12/06/2016 02:45 AM, Sean Owen wrote:
> I read it again and that looks like it implements mean precision@k as
> I would expect. What is the issue?
>
> On Tue, Dec 6, 2016, 07:30 Maciej Szymkiewicz <mszymkiew...@gmail.com
> <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> Could I ask fora fresh pair of eyes on this piece of code:
>
> 
> https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80
>
>   @Since("1.2.0")
>   def precisionAt(k: Int): Double = {
> require(k > 0, "ranking position k should be positive")
> predictionAndLabels.map { case (pred, lab) =>
>   val labSet = lab.toSet
>
>   if (labSet.nonEmpty) {
> val n = math.min(pred.length, k)
> var i = 0
> var cnt = 0
> while (i < n) {
>   if (labSet.contains(pred(i))) {
> cnt += 1
>   }
>   i += 1
> }
> cnt.toDouble / k
>   } else {
> logWarning("Empty ground truth set, check input data")
> 0.0
>   }
> }.mean()
>   }
>
>
> Am I the only one who thinks this doesn't do what it claims? Just
> for reference:
>
>   * 
> https://web.archive.org/web/20120415101144/http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf
>   * 
> https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
>
> -- 
> Best,
> Maciej
>

-- 
Maciej Szymkiewicz



[MLLIB] RankingMetrics.precisionAt

2016-12-05 Thread Maciej Szymkiewicz
Hi,

Could I ask fora fresh pair of eyes on this piece of code:

https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80

  @Since("1.2.0")
  def precisionAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val n = math.min(pred.length, k)
var i = 0
var cnt = 0
while (i < n) {
  if (labSet.contains(pred(i))) {
cnt += 1
  }
  i += 1
}
cnt.toDouble / k
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }


Am I the only one who thinks this doesn't do what it claims? Just for
reference:

  * 
https://web.archive.org/web/20120415101144/http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf
  * 
https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py

-- 
Best,
Maciej



Re: Future of the Python 2 support.

2016-12-05 Thread Maciej Szymkiewicz
Fair enough. I have to admit I am bit disappointed but that's life :)


On 12/04/2016 07:28 PM, Reynold Xin wrote:
> Echoing Nick. I don't see any strong reason to drop Python 2 support.
> We typically drop support for X when it is rarely used and support for
> X is long past EOL. Python 2 is still very popular, and depending on
> the statistics it might be more popular than Python 3.
>
> On Sun, Dec 4, 2016 at 9:29 AM Nicholas Chammas
> <nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com>> wrote:
>
> I don't think it makes sense to deprecate or drop support for
> Python 2.7 until at least 2020, when 2.7 itself will be EOLed. (As
> of Spark 2.0, Python 2.6 support is deprecated and will be removed
> by Spark 2.2. Python 2.7 is only version of Python 2 that's still
> fully supported.)
>
> Given the widespread industry use of Python 2.7, and the fact that
> it is supported upstream by the Python core developers until 2020,
> I don't see why Spark should even consider dropping support for it
> before then. There is, of course, additional ongoing work to
> support Python 2.7, but it seems more than justified by its level
> of use and popularity in the broader community. And I say that as
> someone who almost exclusively develops in Python 3.5+ these days.
>
> Perhaps by 2018 the industry usage of Python 2 will drop
> precipitously and merit a discussion about dropping support, but I
> think at this point it's premature to discuss that and we should
> just wait and see.
>
> Nick
>
>
> On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> I am aware there was a previous discussion about dropping
> support for different platforms
> 
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
> but somehow it has been dominated by Scala and JVM and never
> touched the subject of Python 2.
>
> Some facts:
>
>   * Python 2 End Of Life is scheduled for 2020
> (http://legacy.python.org/dev/peps/pep-0373/) without with
> "no guarantee that bugfix releases will be made on a
> regular basis" until then.
>   * Almost all commonly used libraries already support Python
> 3 (https://python3wos.appspot.com/). A single exception
> that can be important for Spark is thrift (Python 3
> support is already present on the master) and transitively
> PyHive and Blaze.
>   * Supporting both Python 2 and Python 3 introduces
> significant technical debt. In practice Python 3 is a
> different language with backward incompatible syntax and
> growing number of features which won't be backported to 2.x.
>
> Suggestions:
>
>   * We need a public discussion about possible date for
> dropping Python 2 support.
>   * Early 2018 should give enough time for a graceful transition.
>
> -- 
> Best,
> Maciej
>

-- 
Maciej Szymkiewicz



Future of the Python 2 support.

2016-12-04 Thread Maciej Szymkiewicz
Hi,

I am aware there was a previous discussion about dropping support for
different platforms
(http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html)
but somehow it has been dominated by Scala and JVM and never touched the
subject of Python 2.

Some facts:

  * Python 2 End Of Life is scheduled for 2020
(http://legacy.python.org/dev/peps/pep-0373/) without with "no
guarantee that bugfix releases will be made on a regular basis"
until then.
  * Almost all commonly used libraries already support Python 3
(https://python3wos.appspot.com/). A single exception that can be
important for Spark is thrift (Python 3 support is already present
on the master) and transitively PyHive and Blaze.
  * Supporting both Python 2 and Python 3 introduces significant
technical debt. In practice Python 3 is a different language with
backward incompatible syntax and growing number of features which
won't be backported to 2.x.

Suggestions:

  * We need a public discussion about possible date for dropping Python
2 support.
  * Early 2018 should give enough time for a graceful transition.

-- 
Best,
Maciej



Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-02 Thread Maciej Szymkiewicz
Sure, here you are: https://issues.apache.org/jira/browse/SPARK-18690

To be fair I am not fully convinced it is worth it.


On 12/02/2016 12:51 AM, Reynold Xin wrote:
> Can you submit a pull request with test cases based on that change?
>
>
> On Dec 1, 2016, 9:39 AM -0800, Maciej Szymkiewicz
> <mszymkiew...@gmail.com>, wrote:
>>
>> This doesn't affect that. The only concern is what we consider to
>> UNBOUNDED on Python side.
>>
>>
>> On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>>>
>>> I may be mistaken but if I remember correctly spark behaves
>>> differently when it is bounded in the past and when it is not.
>>> Specifically I seem to recall a fix which made sure that when there
>>> is no lower bound then the aggregation is done one by one instead of
>>> doing the whole range for each window. So I believe it should be
>>> configured exactly the same as in scala/java so the optimization
>>> would take place.
>>>
>>> Assaf.
>>>
>>>  
>>>
>>> *From:* rxin [via Apache Spark Developers List]
>>> [mailto:ml-node+[hidden email]
>>> ]
>>> *Sent:* Wednesday, November 30, 2016 8:35 PM
>>> *To:* Mendelson, Assaf
>>> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
>>> function frame boundary API
>>>
>>>  
>>>
>>> Yes I'd define unboundedPreceding to -sys.maxsize, but also any
>>> value less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
>>> unboundedPreceding too. We need to be careful with long overflow
>>> when transferring data over to Java.
>>>
>>>  
>>>
>>>  
>>>
>>> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
>>> > wrote:
>>>
>>> It is platform specific so theoretically can be larger, but 2**63 -
>>> 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit platform.
>>> I can submit a patch but I am not sure how to proceed. Personally I
>>> would set
>>>
>>> unboundedPreceding = -sys.maxsize
>>> unboundedFollowing = sys.maxsize
>>>
>>> to keep backwards compatibility.
>>>
>>> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>>>
>>> Ah ok for some reason when I did the pull request sys.maxsize
>>> was much larger than 2^63. Do you want to submit a patch to fix
>>> this?
>>>
>>>  
>>>
>>>  
>>>
>>> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
>>> email] > wrote:
>>>
>>> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
>>> which used to work before is off by one.
>>>
>>> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>>
>>> Can you give a repro? Anything less than -(1 << 63) is
>>> considered negative infinity (i.e. unbounded preceding).
>>>
>>>  
>>>
>>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
>>> email] > wrote:
>>>
>>> Hi,
>>>
>>> I've been looking at the SPARK-17845 and I am curious if
>>> there is any
>>> reason to make it a breaking change. In Spark 2.0 and below
>>> we could use:
>>>
>>>
>>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>> sys.maxsize))
>>>
>>> In 2.1.0 this code will silently produce incorrect results
>>> (ROWS BETWEEN
>>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>>> compatibility?
>>>
>>> --
>>>
>>> Maciej Szymkiewicz
>>>
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: [hidden email]
>>> 
>>>
>>>  
>>>
>>>  
>>>
>>> -- 
>>>
>>> Maciej Szymkiewicz
>>>
>>>  
>>>
>>>  
>>>
>>> -- 
>>> Maciej Szymkiewicz
>>>
>>>  
>>>
>>>  
>>>
>>> 
>>>
>>> *If

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
This doesn't affect that. The only concern is what we consider to
UNBOUNDED on Python side.


On 12/01/2016 07:56 AM, assaf.mendelson wrote:
>
> I may be mistaken but if I remember correctly spark behaves
> differently when it is bounded in the past and when it is not.
> Specifically I seem to recall a fix which made sure that when there is
> no lower bound then the aggregation is done one by one instead of
> doing the whole range for each window. So I believe it should be
> configured exactly the same as in scala/java so the optimization would
> take place.
>
> Assaf.
>
>  
>
> *From:*rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Wednesday, November 30, 2016 8:35 PM
> *To:* Mendelson, Assaf
> *Subject:* Re: [SPARK-17845] [SQL][PYTHON] More self-evident window
> function frame boundary API
>
>  
>
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>  
>
>  
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz <[hidden email]
> > wrote:
>
> It is platform specific so theoretically can be larger, but 2**63 - 1
> is a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I
> can submit a patch but I am not sure how to proceed. Personally I
> would set
>
> unboundedPreceding = -sys.maxsize
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
>
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>
> Ah ok for some reason when I did the pull request sys.maxsize was
> much larger than 2^63. Do you want to submit a patch to fix this?
>
>  
>
>  
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz <[hidden
> email] > wrote:
>
> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
> which used to work before is off by one.
>
> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>
> Can you give a repro? Anything less than -(1 << 63) is
> considered negative infinity (i.e. unbounded preceding).
>
>  
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz <[hidden
> email] > wrote:
>
> Hi,
>
> I've been looking at the SPARK-17845 and I am curious if there
> is any
> reason to make it a breaking change. In Spark 2.0 and below we
> could use:
>
>
> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> sys.maxsize))
>
>     In 2.1.0 this code will silently produce incorrect results
> (ROWS BETWEEN
> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>     compatibility?
>
> --
>
> Maciej Szymkiewicz
>
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>  
>
>  
>
> -- 
>
> Maciej Szymkiewicz
>
>  
>
>  
>
> -- 
> Maciej Szymkiewicz
>
>  
>
>  
>
> 
>
> *If you reply to this email, your message will be added to the
> discussion below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20069.html
>
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
> ------------
> View this message in context: RE: [SPARK-17845] [SQL][PYTHON] More
> self-evident window function frame boundary API
> <http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-17845-SQL-PYTHON-More-self-evident-window-function-frame-boundary-API-tp20064p20074.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz



Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
It could be something like this
https://github.com/zero323/spark/commit/b1f4d8218629b56b0982ee58f5b93a40305985e0
 
but I am not fully satisfied.

On 11/30/2016 07:34 PM, Reynold Xin wrote:
> Yes I'd define unboundedPreceding to -sys.maxsize, but also any value
> less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered
> unboundedPreceding too. We need to be careful with long overflow when
> transferring data over to Java.
>
>
> On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> It is platform specific so theoretically can be larger, but 2**63
> - 1 is a standard on 64 bit platform and 2**31 - 1 on 32bit
> platform. I can submit a patch but I am not sure how to proceed.
> Personally I would set
>
> unboundedPreceding = -sys.maxsize
>
> unboundedFollowing = sys.maxsize
>
> to keep backwards compatibility.
>
> On 11/30/2016 06:52 PM, Reynold Xin wrote:
>> Ah ok for some reason when I did the pull request sys.maxsize was
>> much larger than 2^63. Do you want to submit a patch to fix this?
>>
>>
>> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz
>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the
>> code which used to work before is off by one.
>>
>> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>>> Can you give a repro? Anything less than -(1 << 63) is
>>> considered negative infinity (i.e. unbounded preceding).
>>>
>>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
>>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>>>
>>> Hi,
>>>
>>> I've been looking at the SPARK-17845 and I am curious if
>>> there is any
>>> reason to make it a breaking change. In Spark 2.0 and
>>> below we could use:
>>>
>>>
>>> 
>>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>>> sys.maxsize))
>>>
>>> In 2.1.0 this code will silently produce incorrect
>>> results (ROWS BETWEEN
>>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>>> Window.unboundedPreceding equal -sys.maxsize to ensure
>>> backward
>>> compatibility?
>>>
>>> --
>>>
>>> Maciej Szymkiewicz
>>>
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>
>>>
>>
>> -- 
>> Maciej Szymkiewicz
>>
>>
>
> -- 
> Maciej Szymkiewicz
>
>

-- 
Maciej Szymkiewicz



Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
It is platform specific so theoretically can be larger, but 2**63 - 1 is
a standard on 64 bit platform and 2**31 - 1 on 32bit platform. I can
submit a patch but I am not sure how to proceed. Personally I would set

unboundedPreceding = -sys.maxsize

unboundedFollowing = sys.maxsize

to keep backwards compatibility.

On 11/30/2016 06:52 PM, Reynold Xin wrote:
> Ah ok for some reason when I did the pull request sys.maxsize was much
> larger than 2^63. Do you want to submit a patch to fix this?
>
>
> On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code
> which used to work before is off by one.
>
> On 11/30/2016 06:43 PM, Reynold Xin wrote:
>> Can you give a repro? Anything less than -(1 << 63) is considered
>> negative infinity (i.e. unbounded preceding).
>>
>> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> Hi,
>>
>> I've been looking at the SPARK-17845 and I am curious if
>> there is any
>> reason to make it a breaking change. In Spark 2.0 and below
>> we could use:
>>
>>
>> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
>> sys.maxsize))
>>
>> In 2.1.0 this code will silently produce incorrect results
>> (ROWS BETWEEN
>> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
>> Window.unboundedPreceding equal -sys.maxsize to ensure backward
>> compatibility?
>>
>> --
>>
>> Maciej Szymkiewicz
>>
>>
>>     -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> <mailto:dev-unsubscr...@spark.apache.org>
>>
>>
>
> -- 
> Maciej Szymkiewicz
>
>

-- 
Maciej Szymkiewicz



Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
The problem is that -(1 << 63) is -(sys.maxsize + 1) so the code which
used to work before is off by one.

On 11/30/2016 06:43 PM, Reynold Xin wrote:
> Can you give a repro? Anything less than -(1 << 63) is considered
> negative infinity (i.e. unbounded preceding).
>
> On Wed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> I've been looking at the SPARK-17845 and I am curious if there is any
> reason to make it a breaking change. In Spark 2.0 and below we
> could use:
>
>
> Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
> sys.maxsize))
>
> In 2.1.0 this code will silently produce incorrect results (ROWS
> BETWEEN
> -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
> Window.unboundedPreceding equal -sys.maxsize to ensure backward
> compatibility?
>
> --
>
> Maciej Szymkiewicz
>
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>
>

-- 
Maciej Szymkiewicz



[SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
Hi,

I've been looking at the SPARK-17845 and I am curious if there is any
reason to make it a breaking change. In Spark 2.0 and below we could use:

Window().partitionBy("foo").orderBy("bar").rowsBetween(-sys.maxsize,
sys.maxsize))

In 2.1.0 this code will silently produce incorrect results (ROWS BETWEEN
-1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use
Window.unboundedPreceding equal -sys.maxsize to ensure backward
compatibility?

-- 

Maciej Szymkiewicz


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
Sorry :) BTW There is another related issue here
https://issues.apache.org/jira/browse/SPARK-17756


On 11/30/2016 05:12 PM, Nicholas Chammas wrote:
> > -1 (non binding) https://issues.apache.org/jira/browse/SPARK-16589
> No matter how useless in practice this shouldn't go to another major
> release.
>
> I agree that that issue is a major one since it relates to
> correctness, but since it's not a regression it technically does not
> merit a -1 vote on the release.
>
> Nick
>
> On Wed, Nov 30, 2016 at 11:00 AM Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> -1 (non binding) https://issues.apache.org/jira/browse/SPARK-16589
> No matter how useless in practice this shouldn't go to another
> major release.
>
>
>
> On 11/30/2016 10:34 AM, Sean Owen wrote:
>> FWIW I am seeing several test failures, each more than once, but,
>> none are necessarily repeatable. These are likely just flaky
>> tests but I thought I'd flag these unless anyone else sees
>> similar failures:
>>
>>
>> - SELECT a.i, b.i FROM oneToTen a JOIN oneToTen b ON a.i = b.i +
>> 1 *** FAILED ***
>>   org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 1 in stage 9.0 failed 1 times, most recent failure:
>> Lost task 1.0 in stage 9.0 (TID 19, localhost, executor driver):
>> java.lang.NullPointerException
>> at
>> 
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.(Unknown
>> Source)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass.generate(Unknown
>> Source)
>>   ...
>>
>>
>> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed:
>> 0.302 sec  <<< ERROR!
>> java.lang.NoSuchMethodError:
>> 
>> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
>> at
>> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>>
>>
>>
>> - SPARK-18360: default table path of tables in default database
>> should depend on the location of default database *** FAILED ***
>>   Timeout of './bin/spark-submit' '--class'
>> 'org.apache.spark.sql.hive.SPARK_18360' '--name' 'SPARK-18360'
>> '--master' 'local-cluster[2,1,1024]' '--conf'
>> 'spark.ui.enabled=false' '--conf'
>> 'spark.master.rest.enabled=false' '--driver-java-options'
>> '-Dderby.system.durability=test'
>> 
>> 'file:/home/srowen/spark-2.1.0/sql/hive/target/tmp/spark-dc9f43f2-ded4-4bcf-947e-d5af6f0e1561/testJar-1480440084611.jar'
>> See the log4j logs for more detail.
>> ...
>>
>>
>> - should clone and clean line object in ClosureCleaner *** FAILED ***
>>   isContain was true Interpreter output contained 'Exception':
>>   java.lang.IllegalStateException: Cannot call methods on a
>> stopped SparkContext.
>>   This stopped SparkContext was created at:
>>   
>>
>>
>> On Tue, Nov 29, 2016 at 5:31 PM Marcelo Vanzin
>> <van...@cloudera.com <mailto:van...@cloudera.com>> wrote:
>>
>> I'll send a -1 because of SPARK-18546. Haven't looked at
>> anything else yet.
>>
>> On Mon, Nov 28, 2016 at 5:25 PM, Reynold Xin
>> <r...@databricks.com <mailto:r...@databricks.com>> wrote:
>> > Please vote on releasing the following candidate as Apache
>> Spark version
>> > 2.1.0. The vote is open until Thursday, December 1, 2016 at
>> 18:00 UTC and
>> > passes if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.1.0
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.1.0-rc1
>> > (80aabc0bd33dc5661a90133156247e7a8c1bf7f5)
>> >
>> > The release files, including signatures, digests, etc. can
>> be found at:
>> >
>> 
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>> 
>> <http://people.apache.org/%7Epwendell/spark-releases/spark-2.1.0-rc1-bin/>
>> >
>> > Release artifac

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
-1 (non binding) https://issues.apache.org/jira/browse/SPARK-16589 No
matter how useless in practice this shouldn't go to another major release.


On 11/30/2016 10:34 AM, Sean Owen wrote:
> FWIW I am seeing several test failures, each more than once, but, none
> are necessarily repeatable. These are likely just flaky tests but I
> thought I'd flag these unless anyone else sees similar failures:
>
>
> - SELECT a.i, b.i FROM oneToTen a JOIN oneToTen b ON a.i = b.i + 1 ***
> FAILED ***
>   org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 9.0 failed 1 times, most recent failure: Lost task 1.0
> in stage 9.0 (TID 19, localhost, executor driver):
> java.lang.NullPointerException
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass.generate(Unknown
> Source)
>   ...
>
>
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.302
> sec  <<< ERROR!
> java.lang.NoSuchMethodError:
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>
>
>
> - SPARK-18360: default table path of tables in default database should
> depend on the location of default database *** FAILED ***
>   Timeout of './bin/spark-submit' '--class'
> 'org.apache.spark.sql.hive.SPARK_18360' '--name' 'SPARK-18360'
> '--master' 'local-cluster[2,1,1024]' '--conf' 'spark.ui.enabled=false'
> '--conf' 'spark.master.rest.enabled=false' '--driver-java-options'
> '-Dderby.system.durability=test'
> 'file:/home/srowen/spark-2.1.0/sql/hive/target/tmp/spark-dc9f43f2-ded4-4bcf-947e-d5af6f0e1561/testJar-1480440084611.jar'
> See the log4j logs for more detail.
> ...
>
>
> - should clone and clean line object in ClosureCleaner *** FAILED ***
>   isContain was true Interpreter output contained 'Exception':
>   java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext.
>   This stopped SparkContext was created at:
>   
>
>
> On Tue, Nov 29, 2016 at 5:31 PM Marcelo Vanzin <van...@cloudera.com
> <mailto:van...@cloudera.com>> wrote:
>
> I'll send a -1 because of SPARK-18546. Haven't looked at anything
> else yet.
>
> On Mon, Nov 28, 2016 at 5:25 PM, Reynold Xin <r...@databricks.com
> <mailto:r...@databricks.com>> wrote:
> > Please vote on releasing the following candidate as Apache Spark
> version
> > 2.1.0. The vote is open until Thursday, December 1, 2016 at
> 18:00 UTC and
> > passes if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.1.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >
> > The tag to be voted on is v2.1.0-rc1
> > (80aabc0bd33dc5661a90133156247e7a8c1bf7f5)
> >
> > The release files, including signatures, digests, etc. can be
> found at:
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
> <http://people.apache.org/%7Epwendell/spark-releases/spark-2.1.0-rc1-bin/>
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1216/
> >
> > The documentation corresponding to this release can be found at:
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
> 
> <http://people.apache.org/%7Epwendell/spark-releases/spark-2.1.0-rc1-docs/>
> >
> >
> > ===
> > How can I help test this release?
> > ===
> > If you are a Spark user, you can help us test this release by
> taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.1.0?
> > ===
> > Committers should look at those and triage. Extremely important
> bug fixes,
> > documentation, and API tweaks that impact compatibility should
> be worked on
> > immediately. Everything else please retarget to 2.1.1 or 2.2.0.
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
>

-- 
Maciej Szymkiewicz



Re: [SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
Thank you.


On 11/25/2016 02:02 PM, Takeshi Yamamuro wrote:
> Hi,
>
> Seems we forget to pass `parts:Array[Partition]` into `JDBCRelation`.
> This was removed in this
> commit: 
> https://github.com/apache/spark/commit/b3130c7b6a1ab4975023f08c3ab02ee8d2c7e995#diff-f70bda59304588cc3abfa3a9840653f4L237
>
> // maropu
>
> On Fri, Nov 25, 2016 at 9:50 PM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> I've been reviewing my notes to https://git.io/v1UVC using Spark
> built from 51b1c1551d3a7147403b9e821fcc7c8f57b4824c and it looks
> like JDBC ignores both:
>
>   * (columnName, lowerBound, upperBound, numPartitions)
>   * predicates
>
> and loads everything into a single partition. Can anyone confirm
> that? It works just fine on 2.0.2 and before.
>
> -- 
> Best,
> Maciej
>
>
>
>
> -- 
> ---
> Takeshi Yamamuro

-- 
Best,
Maciej



[SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
Hi,

I've been reviewing my notes to https://git.io/v1UVC using Spark built
from 51b1c1551d3a7147403b9e821fcc7c8f57b4824c and it looks like JDBC
ignores both:

  * (columnName, lowerBound, upperBound, numPartitions)
  * predicates

and loads everything into a single partition. Can anyone confirm that?
It works just fine on 2.0.2 and before.

-- 
Best,
Maciej



Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-22 Thread Maciej Szymkiewicz
On 11/22/2016 12:11 PM, nirandap wrote:

> Hi Maciej, 
>
> Thank you for your reply. 
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I
> check the final RDBMS table, I see that the results follow the
> expected order, without an issue. Is this just a coincidence?
Not exactly a coincidence. This is typically a result of a physical
location on the disk. If writes and reads are sequential, (this is
usually the case) you'll see things in the expected order, but you have
to remember that location on disk is not stable. For example if you
perform some updates, deletes and VACUM ALL (PostgreSQL) physical
location on disk will change and with it things you see.

There of course more advanced mechanisms out there. For example modern
columnar RDBMS like HANA use techniques like dimensions sorting and
differential stores so even the initial order may differ. There probably
some other solutions which choose different strategies (maybe some times
series oriented projects?) I am not aware of.

>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my
> experience, I see that the given values follow the expected order. May
> I know how spark manages to retain the order of the results in a
> collect operation?
Once you execute ordered operation each partition is sorted and the
order of partitions defines the global ordering. All what collect does
is just preserving this order by creating an array of results for each
partition and flattening it.
>
> Best 
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> > wrote:
>
> In commonly used RDBM systems relations have no fixed order and
> physical location of the records can change during routine
> maintenance operations. Unless you explicitly order data during
> retrieval order you see is incidental and not guaranteed. 
>
> Conclusion: order of inserts just doesn't matter.
>
> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>> Hi, 
>>
>> Say, I have a table with 1 column and 1000 rows. I want to save
>> the result in a RDBMS table using the jdbc relation provider. So
>> I run the following query, 
>>
>> "insert into table table2 select value, count(*) from table1
>> group by value order by value"
>>
>> While debugging, I found that the resultant df from select value,
>> count(*) from table1 group by value order by value would have
>> around 200+ partitions and say I have 4 executors attached to my
>> driver. So, I would have 200+ writing tasks assigned to 4
>> executors. I want to understand, how these executors are able to
>> write the data to the underlying RDBMS table of table2 without
>> messing up the order. 
>>
>> I checked the jdbc insertable relation and in jdbcUtils [1] it
>> does the following
>>
>> df.foreachPartition { iterator =>
>>   savePartition(getConnection, table, iterator, rddSchema,
>> nullTypes, batchSize, dialect)
>> }
>>
>> So, my understanding is, all of my 4 executors will parallely run
>> the savePartition function (or closure) where they do not know
>> which one should write data before the other! 
>>
>> In the savePartition method, in the comment, it says 
>> "Saves a partition of a DataFrame to the JDBC database.  This is
>> done in
>>* a single database transaction in order to avoid repeatedly
>> inserting
>>* data as much as possible."
>>
>> I want to understand, how these parallel executors save the
>> partition without harming the order of the results? Is it by
>> locking the database resource, from each executor (i.e. ex0 would
>> first obtain a lock for the table and write the partition0, while
>> ex1 ... ex3 would wait till the lock is released )? 
>>
>> In my experience, there is no harm done to the order of the
>> results at the end of the day! 
>>
>> Would like to hear from you guys! :-) 
>>
>> [1] 
>> https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>> 
>> <https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277>
>>
>> -- 
>> Niranda Perera
>> @n1r44 <https://twitter.

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-21 Thread Maciej Szymkiewicz
In commonly used RDBM systems relations have no fixed order and physical
location of the records can change during routine maintenance
operations. Unless you explicitly order data during retrieval order you
see is incidental and not guaranteed. 

Conclusion: order of inserts just doesn't matter.

On 11/21/2016 10:03 AM, Niranda Perera wrote:
> Hi, 
>
> Say, I have a table with 1 column and 1000 rows. I want to save the
> result in a RDBMS table using the jdbc relation provider. So I run the
> following query, 
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value,
> count(*) from table1 group by value order by value would have around
> 200+ partitions and say I have 4 executors attached to my driver. So,
> I would have 200+ writing tasks assigned to 4 executors. I want to
> understand, how these executors are able to write the data to the
> underlying RDBMS table of table2 without messing up the order. 
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does
> the following
>
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema,
> nullTypes, batchSize, dialect)
> }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one
> should write data before the other! 
>
> In the savePartition method, in the comment, it says 
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>* a single database transaction in order to avoid repeatedly inserting
>* data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the
> database resource, from each executor (i.e. ex0 would first obtain a
> lock for the table and write the partition0, while ex1 ... ex3 would
> wait till the lock is released )? 
>
> In my experience, there is no harm done to the order of the results at
> the end of the day! 
>
> Would like to hear from you guys! :-) 
>
> [1] 
> https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>
> -- 
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/

-- 
Best regards,
Maciej Szymkiewicz



Re: Handling questions in the mailing lists

2016-11-09 Thread Maciej Szymkiewicz
* Re: Handling questions in the mailing lists
>
>  
>
> To help track and get the verbiage for the Spark community page and
> welcome email jump started, here's a working document for us to work
> with: 
> https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit#
> <https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit>
>
>  
>
> Hope this will help us collaborate on this stuff a little faster.  
>
>  
>
> On Mon, Nov 7, 2016 at 2:25 PM Maciej Szymkiewicz <[hidden email]
> > wrote:
>
> Just a couple of random thoughts regarding Stack Overflow...
>
>   * If we are thinking about shifting focus towards SO all
> attempts of micromanaging should be discarded right in the
> beginning. Especially things like meta tags, which are
> discouraged and "burninated"
> (https://meta.stackoverflow.com/tags/burninate-request/info) ,
> or thread bumping. Depending on a context these won't be
> manageable, go against community guidelines or simply obsolete. 
>   * Lack of expertise is unlikely an issue. Even now there is a
> number of advanced Spark users on SO. Of course the more the
> merrier.
>
> Things that can be easily improved:
>
>   * Identifying, improving and promoting canonical questions and
> answers. It means closing duplicate, suggesting edits to
> improve existing answers, providing alternative solutions.
> This can be also used to identify gaps in the documentation.
>   * Providing a set of clear posting guidelines to reduce effort
> required to identify the problem (think about
> http://stackoverflow.com/q/5963269 a.k.a How to make a great R
> reproducible example?)
>   * Helping users decide if question is a good fit for SO (see
> below). API questions are great fit, debugging problems like
> "my cluster is slow" are not.
>   * Actively cleaning (closing, deleting) off-topic and low
> quality questions. The less junk to sieve through the better
> chance of good questions being answered.
>   * Repurposing and actively moderating SO docs
> (https://stackoverflow.com/documentation/apache-spark/topics).
> Right now most of the stuff that goes there is useless,
> duplicated or plagiarized, or border case SPAM.
>   * Encouraging community to monitor featured
> 
> (https://stackoverflow.com/questions/tagged/apache-spark?sort=featured)
> and active & upvoted & unanswered
> (https://stackoverflow.com/unanswered/tagged/apache-spark)
> questions.
>   * Implementing some procedure to identify questions which are
> likely to be bugs or a material for feature requests.
> Personally I am quite often tempted to simply send a link to
> dev list, but I don't think it is really acceptable.
>   * Animating Spark related chat room. I tried this a couple of
> times but to no avail. Without a certain critical mass of
> users it just won't work.
>
>  
>
>  
>
> On 11/07/2016 07:32 AM, Reynold Xin wrote:
>
> This is an excellent point. If we do go ahead and feature SO
> as a way for users to ask questions more prominently, as
> someone who knows SO very well, would you be willing to help
> write a short guideline (ideally the shorter the better, which
> makes it hard) to direct what goes to user@ and what goes to SO?
>
>  
>
> Sure, I'll be happy to help if I can.
>
>
>
>
>  
>
>  
>
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz <[hidden email]
> > wrote:
>
> Damn, I always thought that mailing list is only for nice and
> welcoming people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list
> which would fit just fine on SO but it is not true in general.
> There are dozens of questions which are to broad, opinion based,
> ask for external resources and so on. If you want to direct users
> to SO you have to help them to decide if it is the right channel.
> Otherwise it will just create a really bad experience for both
> seeking help and active answerers. Former ones will be downvoted
> and bashed, latter ones will have to deal with handling all the
> junk and the number of active Spark users with moderation
> privileges is really low (with only Massg and me being able to
> directly close duplicates).
>
> Believe me, I've se

Re: Handling questions in the mailing lists

2016-11-07 Thread Maciej Szymkiewicz
Just a couple of random thoughts regarding Stack Overflow...

  * If we are thinking about shifting focus towards SO all attempts of
micromanaging should be discarded right in the beginning. Especially
things like meta tags, which are discouraged and "burninated"
(https://meta.stackoverflow.com/tags/burninate-request/info) , or
thread bumping. Depending on a context these won't be manageable, go
against community guidelines or simply obsolete. 
  * Lack of expertise is unlikely an issue. Even now there is a number
of advanced Spark users on SO. Of course the more the merrier.

Things that can be easily improved:

  * Identifying, improving and promoting canonical questions and
answers. It means closing duplicate, suggesting edits to improve
existing answers, providing alternative solutions. This can be also
used to identify gaps in the documentation.
  * Providing a set of clear posting guidelines to reduce effort
required to identify the problem (think about
http://stackoverflow.com/q/5963269 a.k.a How to make a great R
reproducible example?)
  * Helping users decide if question is a good fit for SO (see below).
API questions are great fit, debugging problems like "my cluster is
slow" are not.
  * Actively cleaning (closing, deleting) off-topic and low quality
questions. The less junk to sieve through the better chance of good
questions being answered.
  * Repurposing and actively moderating SO docs
(https://stackoverflow.com/documentation/apache-spark/topics). Right
now most of the stuff that goes there is useless, duplicated or
plagiarized, or border case SPAM.
  * Encouraging community to monitor featured
(https://stackoverflow.com/questions/tagged/apache-spark?sort=featured)
and active & upvoted & unanswered
(https://stackoverflow.com/unanswered/tagged/apache-spark) questions.
  * Implementing some procedure to identify questions which are likely
to be bugs or a material for feature requests. Personally I am quite
often tempted to simply send a link to dev list, but I don't think
it is really acceptable.
  * Animating Spark related chat room. I tried this a couple of times
but to no avail. Without a certain critical mass of users it just
won't work.



On 11/07/2016 07:32 AM, Reynold Xin wrote:
> This is an excellent point. If we do go ahead and feature SO as a way
> for users to ask questions more prominently, as someone who knows SO
> very well, would you be willing to help write a short guideline
> (ideally the shorter the better, which makes it hard) to direct what
> goes to user@ and what goes to SO?

Sure, I'll be happy to help if I can.

>
>
> On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> Damn, I always thought that mailing list is only for nice and
> welcoming people and there is nothing to do for me here >:)
>
> To be serious though, there are many questions on the users list
> which would fit just fine on SO but it is not true in general.
> There are dozens of questions which are to broad, opinion based,
> ask for external resources and so on. If you want to direct users
> to SO you have to help them to decide if it is the right channel.
> Otherwise it will just create a really bad experience for both
> seeking help and active answerers. Former ones will be downvoted
> and bashed, latter ones will have to deal with handling all the
> junk and the number of active Spark users with moderation
> privileges is really low (with only Massg and me being able to
> directly close duplicates).
>
> Believe me, I've seen this before.
>
> On 11/07/2016 05:08 AM, Reynold Xin wrote:
>> You have substantially underestimated how opinionated people can
>> be on mailing lists too :)
>>
>> On Sunday, November 6, 2016, Maciej Szymkiewicz
>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> You have to remember that Stack Overflow crowd (like me) is
>> highly opinionated, so many questions, which could be just
>> fine on the mailing list, will be quickly downvoted and / or
>> closed as off-topic. Just saying...
>>
>> -- 
>> Best, 
>> Maciej
>>
>>
>> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>>> OK I've checked on the ASF member list (which is private so
>>> there is no public archive).
>>>
>>> It is not against any ASF rule to recommend StackOverflow as
>>> a place for users to ask questions. I don't think we can or
>>> should 

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
Damn, I always thought that mailing list is only for nice and welcoming
people and there is nothing to do for me here >:)

To be serious though, there are many questions on the users list which
would fit just fine on SO but it is not true in general. There are
dozens of questions which are to broad, opinion based, ask for external
resources and so on. If you want to direct users to SO you have to help
them to decide if it is the right channel. Otherwise it will just create
a really bad experience for both seeking help and active answerers.
Former ones will be downvoted and bashed, latter ones will have to deal
with handling all the junk and the number of active Spark users with
moderation privileges is really low (with only Massg and me being able
to directly close duplicates).

Believe me, I've seen this before.

On 11/07/2016 05:08 AM, Reynold Xin wrote:
> You have substantially underestimated how opinionated people can be on
> mailing lists too :)
>
> On Sunday, November 6, 2016, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> You have to remember that Stack Overflow crowd (like me) is highly
> opinionated, so many questions, which could be just fine on the
> mailing list, will be quickly downvoted and / or closed as
> off-topic. Just saying...
>
> -- 
> Best, 
> Maciej
>
>
> On 11/07/2016 04:03 AM, Reynold Xin wrote:
>> OK I've checked on the ASF member list (which is private so there
>> is no public archive).
>>
>> It is not against any ASF rule to recommend StackOverflow as a
>> place for users to ask questions. I don't think we can or should
>> delete the existing user@spark list either, but we can certainly
>> make SO more visible than it is.
>>
>>
>>
>> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin <r...@databricks.com
>> <javascript:_e(%7B%7D,'cvml','r...@databricks.com');>> wrote:
>>
>> Actually after talking with more ASF members, I believe the
>> only policy is that development decisions have to be made and
>> announced on ASF properties (dev list or jira), but user
>> questions don't have to. 
>>
>> I'm going to double check this. If it is true, I would
>> actually recommend us moving entirely over the Q part of
>> the user list to stackoverflow, or at least make that the
>> recommended way rather than the existing user list which is
>> not very scalable. 
>>
>>
>> On Wednesday, November 2, 2016, Nicholas Chammas
>> <nicholas.cham...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com');>>
>> wrote:
>>
>> We’ve discussed several times upgrading our communication
>> tools, as far back as 2014 and maybe even before that
>> too. The bottom line is that we can’t due to ASF rules
>> requiring the use of ASF-managed mailing lists.
>>
>> For some history, see this discussion:
>>
>>   * 
>> https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E
>> 
>> <https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E>
>>   * 
>> https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E
>> 
>> <https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E>
>>
>> (It’s ironic that it’s difficult to follow the past
>> discussion on why we can’t change our official
>> communication tools due to those very tools…)
>>
>> Nick
>>
>> ​
>>
>> On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida
>> <ricardo.alme...@actnowib.com> wrote:
>>
>> I fell Assaf point is quite relevant if we want to
>> move this project forward from the Spark user
>> perspective (as I do). In fact, we're still using
>> 20th century tools (mailing lists) with some add-ons
>> (like Stack Overflow).
>>
>> As usually, Sean and Cody's contributions are very to
>> the point.
>> I fell it 

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
You have to remember that Stack Overflow crowd (like me) is highly
opinionated, so many questions, which could be just fine on the mailing
list, will be quickly downvoted and / or closed as off-topic. Just
saying...

-- 
Best, 
Maciej


On 11/07/2016 04:03 AM, Reynold Xin wrote:
> OK I've checked on the ASF member list (which is private so there is
> no public archive).
>
> It is not against any ASF rule to recommend StackOverflow as a place
> for users to ask questions. I don't think we can or should delete the
> existing user@spark list either, but we can certainly make SO more
> visible than it is.
>
>
>
> On Wed, Nov 2, 2016 at 10:21 AM, Reynold Xin  > wrote:
>
> Actually after talking with more ASF members, I believe the only
> policy is that development decisions have to be made and announced
> on ASF properties (dev list or jira), but user questions don't
> have to. 
>
> I'm going to double check this. If it is true, I would actually
> recommend us moving entirely over the Q part of the user list to
> stackoverflow, or at least make that the recommended way rather
> than the existing user list which is not very scalable. 
>
>
> On Wednesday, November 2, 2016, Nicholas Chammas
> >
> wrote:
>
> We’ve discussed several times upgrading our communication
> tools, as far back as 2014 and maybe even before that too. The
> bottom line is that we can’t due to ASF rules requiring the
> use of ASF-managed mailing lists.
>
> For some history, see this discussion:
>
>   * 
> https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAOhmDzfL2COdysV8r5hZN8f=NqXM=f=oy5no2dhwj_kveop...@mail.gmail.com%3E
> 
> 
>   * 
> https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAOhmDzec1JdsXQq3dDwAv7eLnzRidSkrsKKG0xKw=tktxy_...@mail.gmail.com%3E
> 
> 
>
> (It’s ironic that it’s difficult to follow the past discussion
> on why we can’t change our official communication tools due to
> those very tools…)
>
> Nick
>
> ​
>
> On Wed, Nov 2, 2016 at 12:24 PM Ricardo Almeida
>  wrote:
>
> I fell Assaf point is quite relevant if we want to move
> this project forward from the Spark user perspective (as I
> do). In fact, we're still using 20th century tools
> (mailing lists) with some add-ons (like Stack Overflow).
>
> As usually, Sean and Cody's contributions are very to the
> point.
> I fell it is indeed a matter of of culture (hard to
> enforce) and tools (much easier). Isn't it?
>
> On 2 November 2016 at 16:36, Cody Koeninger
>  wrote:
>
> So concrete things people could do
>
> - users could tag subject lines appropriately to the
> component they're
> asking about
>
> - contributors could monitor user@ for tags relating
> to components
> they've worked on.
> I'd be surprised if my miss rate for any mailing list
> questions
> well-labeled as Kafka was higher than 5%
>
> - committers could be more aggressive about soliciting
> and merging PRs
> to improve documentation.
> It's a lot easier to answer even poorly-asked
> questions with a link to
> relevant docs.
>
> On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen
>  wrote:
> > There's already reviews@ and issues@. dev@ is for
> project development itself
> > and I think is OK. You're suggesting splitting up
> user@ and I sympathize
> > with the motivation. Experience tells me that we'll
> have a beginner@ that's
> > then totally ignored, and people will quickly learn
> to post to advanced@ to
> > get attention, and we'll be back where we started.
> Putting it in JIRA
> > doesn't help. I don't think this a problem that is
> merely down to lack of
> > process. It actually requires cultivating a culture
> change on the community
> > list.
> >
> > On Wed, Nov 2, 2016 at 

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-30 Thread Maciej Szymkiewicz
Thanks guys.

This is not a big issue in general. More an annoyance and can be rather
confusing when encountered for the first time.


On 09/29/2016 02:05 AM, Jakob Odersky wrote:
> I agree with Sean's answer, you can check out the relevant serializer
> here 
> https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala
>
> On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen <so...@cloudera.com> wrote:
>> My guess is that Kryo specially handles Maps generically or relies on
>> some mechanism that does, and it happens to iterate over all
>> key/values as part of that and of course there aren't actually any
>> key/values in the map. The Java serialization is a much more literal
>> (expensive) field-by-field serialization which works here because
>> there's no special treatment. I think you could register a custom
>> serializer that handles this case. Or work around it in your client
>> code. I know there have been other issues with Kryo and Map because,
>> for example, sometimes a Map in an application is actually some
>> non-serializable wrapper view.
>>
>> On Wed, Sep 28, 2016 at 3:18 AM, Maciej Szymkiewicz
>> <mszymkiew...@gmail.com> wrote:
>>> Hi everyone,
>>>
>>> I suspect there is no point in submitting a JIRA to fix this (not a Spark
>>> issue?) but I would like to know if this problem is documented anywhere.
>>> Somehow Kryo is loosing default value during serialization:
>>>
>>> scala> import org.apache.spark.{SparkContext, SparkConf}
>>> import org.apache.spark.{SparkContext, SparkConf}
>>>
>>> scala> val aMap = Map[String, Long]().withDefaultValue(0L)
>>> aMap: scala.collection.immutable.Map[String,Long] = Map()
>>>
>>> scala> aMap("a")
>>> res6: Long = 0
>>>
>>> scala> val sc = new SparkContext(new
>>> SparkConf().setAppName("bar").set("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer"))
>>>
>>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>>> 16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
>>> java.util.NoSuchElementException: key not found: a
>>>
>>> while Java serializer works just fine:
>>>
>>> scala> val sc = new SparkContext(new
>>> SparkConf().setAppName("bar").set("spark.serializer",
>>> "org.apache.spark.serializer.JavaSerializer"))
>>>
>>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>>> res9: Long = 0
>>>
>>> --
>>> Best regards,
>>> Maciej
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-- 
Best regards,
Maciej



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone,

I suspect there is no point in submitting a JIRA to fix this (not a
Spark issue?) but I would like to know if this problem is documented
anywhere. Somehow Kryo is loosing default value during serialization:

scala> import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.{SparkContext, SparkConf}

scala> val aMap = Map[String, Long]().withDefaultValue(0L)
aMap: scala.collection.immutable.Map[String,Long] = Map()

scala> aMap("a")
res6: Long = 0

scala> val sc = new SparkContext(new
SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer"))

scala> sc.parallelize(Seq(aMap)).map(_("a")).first
16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0
(TID 7)
java.util.NoSuchElementException: key not found: a

while Java serializer works just fine:

scala> val sc = new SparkContext(new
SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.JavaSerializer"))

scala> sc.parallelize(Seq(aMap)).map(_("a")).first
res9: Long = 0

-- 
Best regards,
Maciej



Re: What happens in Dataset limit followed by rdd

2016-08-03 Thread Maciej Szymkiewicz
Pushing down across mapping would be great. If you're used to SQL or
work frequently with lazy collections this is a behavior you learn to
expect.

On 08/02/2016 02:12 PM, Sun Rui wrote:
> Spark does optimise subsequent limits, for example:
>
> scala> df1.limit(3).limit(1).explain
> == Physical Plan ==
> CollectLimit 1
> +- *SerializeFromObject [assertnotnull(input[0,
> $line14.$read$$iw$$iw$my, true], top level non-flat input
> object).x AS x#2]
>+- Scan ExternalRDDScan[obj#1]
>
> However, limit can not be simply pushes down across mapping functions,
> because the number of rows may change across functions. for example,
> flatMap()
>
> It seems that limit can be pushed across map() which won’t change the
> number of rows. Maybe this is a room for Spark optimisation.
>
>> On Aug 2, 2016, at 18:51, Maciej Szymkiewicz <mszymkiew...@gmail.com
>> <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> Thank you for your prompt response and great examples Sun Rui but I am
>> still confused about one thing. Do you see any particular reason to not
>> to merge subsequent limits? Following case
>>
>>(limit n (map f (limit m ds)))
>>
>> could be optimized to:
>>
>>(map f (limit n (limit m ds)))
>>
>> and further to
>>
>>(map f (limit (min n m) ds))
>>
>> couldn't it?
>>
>>
>> On 08/02/2016 11:57 AM, Sun Rui wrote:
>>> Based on your code, here is simpler test case on Spark 2.0
>>>
>>>case class my (x: Int)
>>>val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
>>>val df1 = spark.createDataFrame(rdd)
>>>val df2 = df1.limit(1)
>>>df1.map { r => r.getAs[Int](0) }.first
>>>df2.map { r => r.getAs[Int](0) }.first // Much slower than the
>>>previous line
>>>
>>> Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so
>>> check the physical plan of the two cases:
>>>
>>>scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
>>>== Physical Plan ==
>>>CollectLimit 1
>>>+- *SerializeFromObject [input[0, int, true] AS value#124]
>>>   +- *MapElements , obj#123: int
>>>  +- *DeserializeToObject createexternalrow(x#74,
>>>StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
>>> +- Scan ExistingRDD[x#74]
>>>
>>>scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
>>>== Physical Plan ==
>>>CollectLimit 1
>>>+- *SerializeFromObject [input[0, int, true] AS value#131]
>>>   +- *MapElements , obj#130: int
>>>  +- *DeserializeToObject createexternalrow(x#74,
>>>StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
>>> +- *GlobalLimit 1
>>>+- Exchange SinglePartition
>>>   +- *LocalLimit 1
>>>  +- Scan ExistingRDD[x#74]
>>>
>>>
>>> For the first case, it is related to an optimisation in
>>> the CollectLimitExec physical operator. That is, it will first fetch
>>> the first partition to get limit number of row, 1 in this case, if not
>>> satisfied, then fetch more partitions, until the desired limit is
>>> reached. So generally, if the first partition is not empty, only the
>>> first partition will be calculated and fetched. Other partitions will
>>> even not be computed.
>>>
>>> However, in the second case, the optimisation in the CollectLimitExec
>>> does not help, because the previous limit operation involves a shuffle
>>> operation. All partitions will be computed, and running LocalLimit(1)
>>> on each partition to get 1 row, and then all partitions are shuffled
>>> into a single partition. CollectLimitExec will fetch 1 row from the
>>> resulted single partition.
>>>
>>>
>>>> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz
>>>> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>
>>>> <mailto:mszymkiew...@gmail.com>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> This doesn't look like something expected, does it?
>>>>
>>>> http://stackoverflow.com/q/38710018/1560062
>>>>
>>>> Quick glance at the UI suggest that there is a shuffle involved and
>>>> input for first is ShuffledRowRDD.
>>>> -- 
>>>> Best regards,
>>>> Maciej Szymkiewicz
>>>
>>
>> -- 
>> Maciej Szymkiewicz
>

-- 
Maciej Szymkiewicz



Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
Thank you for your prompt response and great examples Sun Rui but I am
still confused about one thing. Do you see any particular reason to not
to merge subsequent limits? Following case

(limit n (map f (limit m ds)))

could be optimized to:

(map f (limit n (limit m ds)))

and further to

(map f (limit (min n m) ds))

couldn't it?


On 08/02/2016 11:57 AM, Sun Rui wrote:
> Based on your code, here is simpler test case on Spark 2.0
>
> case class my (x: Int)
> val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) }
> val df1 = spark.createDataFrame(rdd)
> val df2 = df1.limit(1)
> df1.map { r => r.getAs[Int](0) }.first
> df2.map { r => r.getAs[Int](0) }.first // Much slower than the
> previous line
>
> Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so
> check the physical plan of the two cases:
>
> scala> df1.map { r => r.getAs[Int](0) }.limit(1).explain
> == Physical Plan ==
> CollectLimit 1
> +- *SerializeFromObject [input[0, int, true] AS value#124]
>+- *MapElements , obj#123: int
>   +- *DeserializeToObject createexternalrow(x#74,
> StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
>  +- Scan ExistingRDD[x#74]
>
> scala> df2.map { r => r.getAs[Int](0) }.limit(1).explain
> == Physical Plan ==
> CollectLimit 1
> +- *SerializeFromObject [input[0, int, true] AS value#131]
>+- *MapElements , obj#130: int
>   +- *DeserializeToObject createexternalrow(x#74,
> StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
>  +- *GlobalLimit 1
> +- Exchange SinglePartition
>+- *LocalLimit 1
>   +- Scan ExistingRDD[x#74]
>
>
> For the first case, it is related to an optimisation in
> the CollectLimitExec physical operator. That is, it will first fetch
> the first partition to get limit number of row, 1 in this case, if not
> satisfied, then fetch more partitions, until the desired limit is
> reached. So generally, if the first partition is not empty, only the
> first partition will be calculated and fetched. Other partitions will
> even not be computed.
>
> However, in the second case, the optimisation in the CollectLimitExec
> does not help, because the previous limit operation involves a shuffle
> operation. All partitions will be computed, and running LocalLimit(1)
> on each partition to get 1 row, and then all partitions are shuffled
> into a single partition. CollectLimitExec will fetch 1 row from the
> resulted single partition.
>
>
>> On Aug 2, 2016, at 09:08, Maciej Szymkiewicz <mszymkiew...@gmail.com
>> <mailto:mszymkiew...@gmail.com>> wrote:
>>
>> Hi everyone,
>>
>> This doesn't look like something expected, does it?
>>
>> http://stackoverflow.com/q/38710018/1560062
>>
>> Quick glance at the UI suggest that there is a shuffle involved and
>> input for first is ShuffledRowRDD.
>> -- 
>> Best regards,
>> Maciej Szymkiewicz
>

-- 
Maciej Szymkiewicz


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz
Hi everyone,

This doesn't look like something expected, does it?

http://stackoverflow.com/q/38710018/1560062

Quick glance at the UI suggest that there is a shuffle involved and
input for first is ShuffledRowRDD.

-- 
Best regards,
Maciej Szymkiewicz



Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-27 Thread Maciej Szymkiewicz
Hi Jacek,

In this context, don't you think it would be useful, if at least some
traits from org.apache.spark.ml.param.shared.sharedParams were
public?HasInputCol(s) and HasOutputCol for example. These are useful
pretty much every time you create custom Transformer. 

-- 
Pozdrawiam,
Maciej Szymkiewicz


On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
> Hi Joseph,
>
> Thanks for the response. I'm one who doesn't understand all the
> hype/need for Machine Learning...yet and through Spark ML(lib) glasses
> I'm looking at ML space. In the meantime I've got few assignments (in
> a project with Spark and Scala) that have required quite extensive
> dataset manipulation.
>
> It was when I sinked into using DataFrame/Dataset for data
> manipulation not RDD (I remember talking to Brian about how RDD is an
> "assembly" language comparing to the higher-level concept of
> DataFrames with Catalysts and other optimizations). After few days
> with DataFrame I learnt he was so right! (sorry Brian, it took me
> longer to understand your point).
>
> I started using DataFrames in far too many places than one could ever
> accept :-) I was so...carried away with DataFrames (esp. show vs
> foreach(println) and UDFs via udf() function)
>
> And then, when I moved to Pipeline API and discovered Transformers.
> And PipelineStage that can create pipelines of DataFrame manipulation.
> They read so well that I'm pretty sure people would love using them
> more often, but...they belong to MLlib so they are part of ML space
> (not many devs tackled yet). I applied the approach to using
> withColumn to have better debugging experience (if I ever need it). I
> learnt it after having watched your presentation about Pipeline API.
> It was so helpful in my RDD/DataFrame space.
>
> So, to promote a more extensive use of Pipelines, PipelineStages, and
> Transformers, I was thinking about moving that part to SQL/DataFrame
> API where they really belong. If not, I think people might miss the
> beauty of the very fine and so helpful Transformers.
>
> Transformers are *not* a ML thing -- they are DataFrame thing and
> should be where they really belong (for their greater adoption).
>
> What do you think?
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <jos...@databricks.com> wrote:
>> There have been some comments about using Pipelines outside of ML, but I
>> have not yet seen a real need for it.  If a user does want to use Pipelines
>> for non-ML tasks, they still can use Transformers + PipelineModels.  Will
>> that work?
>>
>> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>> Hi,
>>>
>>> After few weeks with spark.ml now, I came to conclusion that
>>> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
>>> of DataFrame (SQL) where they fit better. Are there any plans to
>>> migrate Transformer API (ML) to DataFrame (SQL)?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>




signature.asc
Description: OpenPGP digital signature


ML ALS API

2016-03-07 Thread Maciej Szymkiewicz
Can I ask for a clarifications regarding ml.recommendation.ALS:

- is train method
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L598)
intended to be be public?
- Rating class 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is
using float instead of double like its MLLib counterpart. Is it going to
be a default encoding in 2.0+?

-- 
Best,
Maciej Szymkiewicz




signature.asc
Description: OpenPGP digital signature


Re: DataFrame API and Ordering

2016-02-19 Thread Maciej Szymkiewicz
I am not sure. Spark SQL, DataFrames and Datasets Guide already has a
section about NaN semantics. This could be a good place to add at least
some basic description.

For the rest InterpretedOrdering could be a good choice.

On 02/19/2016 12:35 AM, Reynold Xin wrote:
> You are correct and we should document that.
>
> Any suggestions on where we should document this? In DoubleType and
> FloatType?
>
> On Tuesday, February 16, 2016, Maciej Szymkiewicz
> <mszymkiew...@gmail.com <mailto:mszymkiew...@gmail.com>> wrote:
>
> I am not sure if I've missed something obvious but as far as I can
> tell
> DataFrame API doesn't provide a clearly defined ordering rules
> excluding
> NaN handling. Methods like DataFrame.sort or sql.functions like min /
> max provide only general description. Discrepancy between
> functions.max
> (min) and GroupedData.max where the latter one supports only numeric
> makes current situation even more confusing. With growing number of
> orderable types I believe that documentation should clearly define
> ordering rules including:
>
> - NULL behavior
> - collation
> - behavior on complex types (structs, arrays)
>
> While this information can extracted from the source it is not easily
> accessible and without explicit specification it is not clear if
> current
> behavior is contractual. It can be also confusing if user expects an
> order depending on a current locale (R).
>
> Best,
> Maciej
>



signature.asc
Description: OpenPGP digital signature


DataFrame API and Ordering

2016-02-16 Thread Maciej Szymkiewicz
I am not sure if I've missed something obvious but as far as I can tell
DataFrame API doesn't provide a clearly defined ordering rules excluding
NaN handling. Methods like DataFrame.sort or sql.functions like min /
max provide only general description. Discrepancy between functions.max
(min) and GroupedData.max where the latter one supports only numeric
makes current situation even more confusing. With growing number of
orderable types I believe that documentation should clearly define
ordering rules including:

- NULL behavior
- collation
- behavior on complex types (structs, arrays)

While this information can extracted from the source it is not easily
accessible and without explicit specification it is not clear if current
behavior is contractual. It can be also confusing if user expects an
order depending on a current locale (R).

Best,
Maciej



signature.asc
Description: OpenPGP digital signature