Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
why I asked. > > >   > ---- > *From:* Maciej Szymkiewicz > *Sent:* Tuesday, August 4, 2020 12:59 PM > *To:* Sean Owen > *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; > Spark Dev List > *Subject:* Re: [PySpark] Revisiting PySpark type annotat

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
/pyspark-stubs/graphs/contributors) and at least some use cases (https://stackoverflow.com/q/40163106/). So, subjectively speaking, it seems we're already beyond POC. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
n for separate git repo? >> >> >> From: Hyukjin Kwon >> Sent: Monday, August 3, 2020 1:58:55 AM >> To: Maciej Szymkiewicz >> Cc: Driesprong, Fokko ; Holden Karau >> ; Spark Dev List >> Subject: Re: [PySpark] Revisiting PySpark type ann

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
sync right? > Do you provide different stubs for different versions of Python? I had to > look up the literals: https://www.python.org/dev/peps/pep-0586/ > I think it is more about portability between Spark versions > > > Cheers, Fokko > > Op wo 22 jul. 2020 om 09:40 schr

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
e. -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC signature.asc Description: OpenPGP digital signature

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > > > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, H

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
to the main > repository? > > > > -- > Sent from: > http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > ---

Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
; treated as private. > > Is this intentional?  If so, what's the rationale?  If not, then it > feels like a bug and DataFrame should have some form of public access > back to the context/session.  I'm happy to log the bug but thought I > would ask here first.  Thanks! --

Re: Apache Spark Docker image repository

2020-02-06 Thread Maciej Szymkiewicz
>       (This can be used in GitHub Action Jobs and Jenkins K8s > Integration Tests to speed up jobs and to have more stabler > environments) > > > > Bests, > > Dongjoon. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.

Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Maciej Szymkiewicz
I think it is important to distinguish between two different concepts: * Adherence to standards and their well established implementations. * Enabling migrations from some product X to Spark. While these two problems are related, there are independent and one can be achieved without the other

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-30 Thread Maciej Szymkiewicz
rt next January > (https://spark.apache.org/versioning-policy.html), > I'm +1 for the deprecation (Python < 3.6) > at Apache Spark 3.0.0. > > It's just a deprec

[DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-24 Thread Maciej Szymkiewicz
Hi everyone, While deprecation of Python 2 in 3.0.0 has been announced , there is no clear statement about specific continuing support of different Python 3 version. Specifically: * Python 3.4 has been retired this year.

Is SPARK-9961 is still relevant?

2019-10-05 Thread Maciej Szymkiewicz
Hi everyone, I just encountered SPARK-9961 which seems to be largely outdated today. In the latest releases majority of models computes different evaluation metrics exposed later through corresponding summaries.  At the same time such defaultEval

Re: Introduce FORMAT clause to CAST with SQL:2016 datetime patterns

2019-03-20 Thread Maciej Szymkiewicz
One concern here is introduction of second formatting convention. This can not only cause confusion among users, but also result in some hard to spot bugs, when wrong format, with different meaning, is used. This is already a problem for Python and R users, with week year and months / minutes mixu

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high res

[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
Hello everyone, I'd like to revisit the topic of adding PySpark type annotations in 3.0. It has been discussed before ( http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html and http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySp

Re: Documentation of boolean column operators missing?

2018-10-23 Thread Maciej Szymkiewicz
Even if these were documented Sphinx doesn't include dunder methods by default (with exception to __init__). There is :special-members: option which could be passed to, for example, autoclass. On Tue, 23 Oct 2018 at 21:32, Sean Owen wrote: > (& and | are both logical and bitwise operators in Jav

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
arry on, but do we want to take that baggage into Apache Spark 3.x > era? The next time you may drop it would be only 4.0 release because > of breaking change. > > -- > ,,,^..^,,, > On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz > wrote: > > > > There is no need to d

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options - Use stub files and limit yourself to support only Python 3 support. Python 3 users benefit from type hints, Python 2 users don't, but no core functionality is affected. This is the approach I've used with https://git

Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran, On Wed, 29 Aug 2018 at 22:26, Imran Rashid wrote: > Hi Li, > > yes that makes perfect sense. That more-or-less is the same as my view, > though I framed it differently. I guess in that case, I'm really asking: > > Can pyspark changes please be accompanied by more unit tests, and not

Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions: - https://stackoverflow.com/q/41670103/1560062 - https://stackoverflow.com/q/42465568/1560062 - https://stackoverflow.com/q/41670103/1560062 it is probably more "nobody thought about asking", than "it is not used often". On Wed, 22 Aug 2018 at

Re: Increase Timeout or optimize Spark UT?

2017-08-24 Thread Maciej Szymkiewicz
spark/sql/test/TestSQLContext.scala#L60-L61> > ? > > On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz < > mszymkiew...@gmail.com> wrote: > >> Hi, >> >> From my experience it is possible to cut quite a lot by reducing >> spark.sql.shuffle.parti

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Maciej Szymkiewicz
Hi, >From my experience it is possible to cut quite a lot by reducing spark.sql.shuffle.partitions to some reasonable value (let's say comparable to the number of cores). 200 is a serious overkill for most of the test cases anyway. Best, Maciej On 21 August 2017 at 03:00, Dong Joon Hyun wrot

Re: Possible bug: inconsistent timestamp behavior

2017-08-15 Thread Maciej Szymkiewicz
> > Assaf > > > > -- > View this message in context: Possible bug: inconsistent timestamp > behavior > <http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-inconsistent-timestamp-behavior-tp22144.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Z poważaniem, Maciej Szymkiewicz

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer: https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py which should at least partially address the problem. On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > I just wanted to highlight some of the rough edges around using > vect

Re: spark messing up handling of native dependency code?

2017-06-02 Thread Maciej Szymkiewicz
Maybe not related, but in general geotools are not thread safe,so using from workers is most likely a gamble. On 06/03/2017 01:26 AM, Georg Heiler wrote: > Hi, > > There is a weird problem with spark when handling native dependency code: > I want to use a library (JAI) with spark to parse some spa

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
st the existing pyspark, they just have > to be run with a compatible packaging (e.g. mypy). > > Meaning that porting for python 2 would provide a very small advantage > over the immediate advantages (IDE usage and testing for most cases). > > > > Am I missing something? >

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Maciej Szymkiewicz
2 (dynamic metaclasses), which is could be resolved without significant loss of function. On 05/23/2017 12:08 PM, Reynold Xin wrote: > Seems useful to do. Is there a way to do this so it doesn't break > Python 2.x? > > > On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz > mail

[PYTHON] PySpark typing hints

2017-05-14 Thread Maciej Szymkiewicz
Hi everyone, For the last few months I've been working on static type annotations for PySpark. For those of you, who are not familiar with the idea, typing hints have been introduced by PEP 484 (https://www.python.org/dev/peps/pep-0484/) and further extended with PEP 526 (https://www.python.org/de

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Maciej Szymkiewicz
https://issues.apache.org/jira/browse/SPARK-5293 On 05/07/2017 08:59 PM, geoHeil wrote: > Hi, > > I am curious why spark (with 2.0 completely) removed any akka dependencies > for RPC and switched entirely to (as far as I know natty) > > regards, > Georg > > > > -- > View this message in context:

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Maciej Szymkiewicz
I am not sure if it is relevant but explode_outer and posexplode_outer seem to be broken: SPARK-20534 On 04/28/2017 12:49 AM, Sean Owen wrote: > By the way the RC looks good. Sigs and license are OK, tests pass with > -Phive -Pyarn -Phadoop-2.7.

[SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Maciej Szymkiewicz
errors.package$.attachTree(package.scala:56) ... Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366 in [sum#385,max#386,x#360,AmtPaid#361] ... Is it a known issue or do we need a JIRA? -- Best, Maciej Szymkiewicz - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[ML][PYTHON] Collecting data in a class extending SparkSessionTestCase causes AttributeError:

2017-03-06 Thread Maciej Szymkiewicz
Hi everyone, It is a either to late or to early for me to think straight so please forgive me if it is something trivial. I am trying to add a test case extending SparkSessionTestCase to pyspark.ml.tests (example patch attached). If test collects data, and there is another TestCase extending exten

Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
> py4j in our repo but could instead have a pinned version > required. While we do depend on a lot of py4j internal APIs, > version pinning should be sufficient to ensure functionality > (and simplify the update process). > > Cheers, > > Holden :) > > -- > Twitter: https://twitter.com/holdenkarau > <https://twitter.com/holdenkarau> > > > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau -- Maciej Szymkiewicz

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Maciej Szymkiewicz
Congratulations! On 02/13/2017 08:16 PM, Reynold Xin wrote: > Hi all, > > Takuya-san has recently been elected an Apache Spark committer. He's > been active in the SQL area and writes very small, surgical patches > that are high quality. Please join me in congratulating Takuya-san! > --

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-03 Thread Maciej Szymkiewicz
a PR for this. >>>>>> zero323 wrote >>>>>>> Hi everyone, >>>>>>> >>>>>>> While experimenting with ML pipelines I experience a significant >>>>>>> performance regression when switching from 1.6.x to 2.x. >>>>>>> >>>>>>> import

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
ges I've missed, that could lead to this >>> behavior? >>> >>> -- >>> Best, >>> Maciej >>> >>> >>> ----- >>> To unsubscribe e-mail: >>> dev-unsubscribe@.apache > > > > > - > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > -- Maciej Szymkiewicz

[SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-01-31 Thread Maciej Szymkiewicz
Hi everyone, While experimenting with ML pipelines I experience a significant performance regression when switching from 1.6.x to 2.x. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} val df = (1 to 40).foldLe

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
given in days and less, even > though it could be 365 days, and fix the documentation. > 2) Explicitly disallow it as there may be a lot of data for a given > window, but partial aggregations should help with that. > > My thoughts are to go with 1. What do you think? &

[SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Maciej Szymkiewicz
Hi, Can I ask for some clarifications regarding intended behavior of window / TimeWindow? PySpark documentation states that "Windows in the order of months are not supported". This is further confirmed by the checks in TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l). Surprisingly eno

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
> > > > Following up, any thoughts on next steps for this? > > > > > > > > > > > > > > > *From:* Maciej Szymkiewicz <mailto:mszymkiew...@gmail.com>

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Maciej Szymkiewicz
rote: > > > > > > > > > > > > > > > > Following up, any thoughts on next steps for this? > > > > > > > > > > > > > > > *

[PYSPARK] Python tests organization

2017-01-11 Thread Maciej Szymkiewicz
Hi, I can't help but wonder if there is any practical reason for keeping monolithic test modules. These things are already pretty large (1500 - 2200 LOCs) and can only grow. Development aside, I assume that many users use tests the same way as me, to check the intended behavior, and largish loosel

Re: [SQL][PYTHON] UDF improvements.

2017-01-10 Thread Maciej Szymkiewicz
rom the gist? Thanks! > > rb > > On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been looking at the PySpark UserDefinedFunction and I have a > couple of suggestions how it could be im

[SQL][PYTHON] UDF improvements.

2017-01-07 Thread Maciej Szymkiewicz
Hi, I've been looking at the PySpark UserDefinedFunction and I have a couple of suggestions how it could be improved including: * Full featured decorator syntax. * Docstring handling improvements. * Lazy initialization. I summarized all suggestions with links to possible solutions in gist

Re: shapeless in spark 2.1.0

2016-12-29 Thread Maciej Szymkiewicz
ns a spark user that uses shapeless in his own > development cannot upgrade safely from 2.0.0 to 2.1.0, i think. > > wish i had noticed this sooner > -- Maciej Szymkiewicz

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
> scala> testUnion(5000) > > 822305 miliseconds > > res8: Long = 822305 > > > > > > > > View this message in context: repeated unioning of dataframes take > worse than O(N^2) t

Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
; On Tue, Dec 6, 2016 at 9:43 PM Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > Thank you Sean. > > Maybe I am just confused about the language. When I read that it > returns "the average precision at the first k ranking positions" I

Re: [MLLIB] RankingMetrics.precisionAt

2016-12-06 Thread Maciej Szymkiewicz
not enough sleep. On 12/06/2016 02:45 AM, Sean Owen wrote: > I read it again and that looks like it implements mean precision@k as > I would expect. What is the issue? > > On Tue, Dec 6, 2016, 07:30 Maciej Szymkiewicz <mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > &

[MLLIB] RankingMetrics.precisionAt

2016-12-05 Thread Maciej Szymkiewicz
Hi, Could I ask fora fresh pair of eyes on this piece of code: https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L59-L80 @Since("1.2.0") def precisionAt(k: Int): Double = { require(k >

Re: Future of the Python 2 support.

2016-12-05 Thread Maciej Szymkiewicz
rit a discussion about dropping support, but I > think at this point it's premature to discuss that and we should > just wait and see. > > Nick > > > On Sun, Dec 4, 2016 at 10:59 AM Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>&

Future of the Python 2 support.

2016-12-04 Thread Maciej Szymkiewicz
Hi, I am aware there was a previous discussion about dropping support for different platforms (http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html) but somehow it has been dominated by Scala and JVM and never touched the sub

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-02 Thread Maciej Szymkiewicz
Sure, here you are: https://issues.apache.org/jira/browse/SPARK-18690 To be fair I am not fully convinced it is worth it. On 12/02/2016 12:51 AM, Reynold Xin wrote: > Can you submit a pull request with test cases based on that change? > > > On Dec 1, 2016, 9:39 AM -0800, Maciej

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
me boundary API > > > > Yes I'd define unboundedPreceding to -sys.maxsize, but also any value > less than min(-sys.maxsize, _JAVA_MIN_LONG) are considered > unboundedPreceding too. We need to be careful with long overflow when > transferring data over to Java. > > >

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-12-01 Thread Maciej Szymkiewicz
sys.maxsize, _JAVA_MIN_LONG) are considered > unboundedPreceding too. We need to be careful with long overflow when > transferring data over to Java. > > > On Wed, Nov 30, 2016 at 10:04 AM, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > It is platform speci

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
backwards compatibility. On 11/30/2016 06:52 PM, Reynold Xin wrote: > Ah ok for some reason when I did the pull request sys.maxsize was much > larger than 2^63. Do you want to submit a patch to fix this? > > > On Wed, Nov 30, 2016 at 9:48 AM, Maciej Szymkiewicz > mailto:mszymkiew...

Re: [SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
ed, Nov 30, 2016 at 8:27 AM, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been looking at the SPARK-17845 and I am curious if there is any > reason to make it a breaking change. In Spark 2.0 and below we > could use: &g

[SPARK-17845] [SQL][PYTHON] More self-evident window function frame boundary API

2016-11-30 Thread Maciej Szymkiewicz
produce incorrect results (ROWS BETWEEN -1 PRECEDING AND UNBOUNDED FOLLOWING) Couldn't we use Window.unboundedPreceding equal -sys.maxsize to ensure backward compatibility? -- Maciej Szymkiewicz - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
another major > release. > > I agree that that issue is a major one since it relates to > correctness, but since it's not a regression it technically does not > merit a -1 vote on the release. > > Nick > > On Wed, Nov 30, 2016 at 11:00 AM Maciej Szymkiewicz > mailt

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Maciej Szymkiewicz
=== > > What should happen to JIRA tickets still targeting 2.1.0? > > === > > Committers should look at those and triage. Extremely important > bug fixes, > > documentation, and API tweaks that impact compatibility should > be worked on > > immediately. Everything else please retarget to 2.1.1 or 2.2.0. > > > > > > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > -- Maciej Szymkiewicz

Re: [SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
02ee8d2c7e995#diff-f70bda59304588cc3abfa3a9840653f4L237 > > // maropu > > On Fri, Nov 25, 2016 at 9:50 PM, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > Hi, > > I've been reviewing my notes to https://git.io/v1UVC using Spark > built from 51b1c1551d3a7147403b9e8

[SQL][JDBC] Possible regression in JDBC reader

2016-11-25 Thread Maciej Szymkiewicz
Hi, I've been reviewing my notes to https://git.io/v1UVC using Spark built from 51b1c1551d3a7147403b9e821fcc7c8f57b4824c and it looks like JDBC ignores both: * (columnName, lowerBound, upperBound, numPartitions) * predicates and loads everything into a single partition. Can anyone confirm th

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-22 Thread Maciej Szymkiewicz
ach partition is sorted and the order of partitions defines the global ordering. All what collect does is just preserving this order by creating an array of results for each partition and flattening it. > > Best > > > On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spa

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-21 Thread Maciej Szymkiewicz
ain/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277 > > -- > Niranda Perera > @n1r44 <https://twitter.com/N1R44> > +94 71 554 8430 > https://www.linkedin.com/in/niranda > https://pythagoreanscript.wordpress.com/ -- Best regards, Maciej Szymkiewicz

Re: Handling questions in the mailing lists

2016-11-09 Thread Maciej Szymkiewicz
ck and get the verbiage for the Spark community page and > welcome email jump started, here's a working document for us to work > with: > https://docs.google.com/document/d/1N0pKatcM15cqBPqFWCqIy6jdgNzIoacZlYDCjufBh2s/edit# > <https://docs.google.com/document/d/1N0pKatcM15cqBPqFWC

Re: Handling questions in the mailing lists

2016-11-07 Thread Maciej Szymkiewicz
er@ and what goes to SO? Sure, I'll be happy to help if I can. > > > On Sun, Nov 6, 2016 at 9:54 PM, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > Damn, I always thought that mailing list is only for nice and > welcoming people and there

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
bstantially underestimated how opinionated people can be on > mailing lists too :) > > On Sunday, November 6, 2016, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > You have to remember that Stack Overflow crowd (like me) is highly > opinionated, so many

Re: Handling questions in the mailing lists

2016-11-06 Thread Maciej Szymkiewicz
You have to remember that Stack Overflow crowd (like me) is highly opinionated, so many questions, which could be just fine on the mailing list, will be quickly downvoted and / or closed as off-topic. Just saying... -- Best, Maciej On 11/07/2016 04:03 AM, Reynold Xin wrote: > OK I've checked o

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-30 Thread Maciej Szymkiewicz
r a custom >> serializer that handles this case. Or work around it in your client >> code. I know there have been other issues with Kryo and Map because, >> for example, sometimes a Map in an application is actually some >> non-serializable wrapper view. >> >> O

java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone, I suspect there is no point in submitting a JIRA to fix this (not a Spark issue?) but I would like to know if this problem is documented anywhere. Somehow Kryo is loosing default value during serialization: scala> import org.apache.spark.{SparkContext, SparkConf} import org.a

Re: What happens in Dataset limit followed by rdd

2016-08-03 Thread Maciej Szymkiewicz
mply pushes down across mapping functions, > because the number of rows may change across functions. for example, > flatMap() > > It seems that limit can be pushed across map() which won’t change the > number of rows. Maybe this is a room for Spark optimisation. > >> On Aug 2, 20

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz
owever, in the second case, the optimisation in the CollectLimitExec > does not help, because the previous limit operation involves a shuffle > operation. All partitions will be computed, and running LocalLimit(1) > on each partition to get 1 row, and then all partitions are shuffled >

What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz
Hi everyone, This doesn't look like something expected, does it? http://stackoverflow.com/q/38710018/1560062 Quick glance at the UI suggest that there is a shuffle involved and input for first is ShuffledRowRDD. -- Best regards, Maciej Szymkiewicz

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-27 Thread Maciej Szymkiewicz
Hi Jacek, In this context, don't you think it would be useful, if at least some traits from org.apache.spark.ml.param.shared.sharedParams were public?HasInputCol(s) and HasOutputCol for example. These are useful pretty much every time you create custom Transformer. -- Pozdrawiam, M

ML ALS API

2016-03-07 Thread Maciej Szymkiewicz
/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is using float instead of double like its MLLib counterpart. Is it going to be a default encoding in 2.0+? -- Best, Maciej Szymkiewicz signature.asc Description: OpenPGP digital signature

Re: DataFrame API and Ordering

2016-02-19 Thread Maciej Szymkiewicz
we should document that. > > Any suggestions on where we should document this? In DoubleType and > FloatType? > > On Tuesday, February 16, 2016, Maciej Szymkiewicz > mailto:mszymkiew...@gmail.com>> wrote: > > I am not sure if I've missed something obvious but as f

DataFrame API and Ordering

2016-02-16 Thread Maciej Szymkiewicz
I am not sure if I've missed something obvious but as far as I can tell DataFrame API doesn't provide a clearly defined ordering rules excluding NaN handling. Methods like DataFrame.sort or sql.functions like min / max provide only general description. Discrepancy between functions.max (min) and Gr