Re: structured streaming join of streaming dataframe with static dataframe performance

2022-08-04 Thread Koert Kuipers
ou might want to wait for more > authoritative response > > Sent from my iPhone > > On Jul 17, 2022, at 5:38 PM, Koert Kuipers wrote: > >  > i was surprised to find out that if a streaming dataframe is joined with a > static dataframe, that the static dataframe i

structured streaming join of streaming dataframe with static dataframe performance

2022-07-17 Thread Koert Kuipers
i was surprised to find out that if a streaming dataframe is joined with a static dataframe, that the static dataframe is re-shuffled for every microbatch, which adds considerable overhead. wouldn't it make more sense to re-use the shuffle files? or if that is not possible then load the static

Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-03 Thread Koert Kuipers
how would scala udfs be supported in this? On Fri, Jun 3, 2022 at 1:52 PM Martin Grund wrote: > Hi Everyone, > > We would like to start a discussion on the "Spark Connect" proposal. > Please find the links below: > > *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375 > *SPIP Document* -

Re: The draft of the Spark 3.3.0 release notes

2022-06-03 Thread Koert Kuipers
i thought SPARK-36837 didnt make it in? i see it in notes On Fri, Jun 3, 2022 at 4:31 AM Maxim Gekk wrote: > Hi All, > > I am preparing the release notes of Spark 3.3.0. Here is a draft document: > >

Re: Scala 2.13 actual class used for Seq

2020-10-19 Thread Koert Kuipers
Array > anymore (I think that doesn't exist anymore in 2.13? ArraySeq now?) so user > apps may need to change for 2.13, but, there are N things that any 2.13 app > would have to change. > > On Mon, Oct 19, 2020 at 12:29 AM Koert Kuipers wrote: > >> i have gotten used to spark always re

Scala 2.13 actual class used for Seq

2020-10-18 Thread Koert Kuipers
i have gotten used to spark always returning a WrappedArray for Seq. at some point i think i even read this was guaranteed to be the case. not sure if it still is... in spark 3.0.1 with scala 2.12 i get a WrappedArray as expected: scala> val x = Seq((1,2),(1,3)).toDF x:

Re: Scala 3 support approach

2020-10-18 Thread Koert Kuipers
i think scala 3.0 will be able to use libraries built with Scala 2.13 (as long as they dont use macros) see: https://www.scala-lang.org/2019/12/18/road-to-scala-3.html On Sun, Oct 18, 2020 at 9:54 AM Sean Owen wrote: > Spark depends on a number of Scala libraries, so needs them all to support

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support hadoop2 + hive 1.2. is that correct? so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with hive? my use case is building spark 3.1 and launching on these existing clusters that are not managed by me. e.g. i do

Re: [FYI] Removing `spark-3.1.0-bin-hadoop2.7-hive1.2.tgz` from Apache Spark 3.1 distribution

2020-10-07 Thread Koert Kuipers
i am a little confused about this. i assumed spark would no longer make a distribution with hive 1.x, but the hive-1.2 profile remains. yet i see the hive-1.2 profile has been removed from pom.xml? On Wed, Sep 23, 2020 at 6:58 PM Dongjoon Hyun wrote: > Hi, All. > > Since Apache Spark 3.0.0,

Re: AQE effectiveness

2020-09-29 Thread Koert Kuipers
t; and should not justify turning off AQE for SQL cache. > > On Thu, Aug 20, 2020 at 10:54 PM Koert Kuipers wrote: > >> in our inhouse spark version i changed this without trouble and it didnt >> even break any tests >> just some minor changes in CacheManager it seems >>

Re: AQE effectiveness

2020-08-20 Thread Koert Kuipers
; > On Thu, Aug 20, 2020 at 11:12 AM Koert Kuipers wrote: > >> we tend to have spark.sql.shuffle.partitions set very high by default >> simply because some jobs need it to be high and it's easier to then just >> set the default high instead of having people tune it manually per

AQE effectiveness

2020-08-20 Thread Koert Kuipers
we tend to have spark.sql.shuffle.partitions set very high by default simply because some jobs need it to be high and it's easier to then just set the default high instead of having people tune it manually per job. the main downside is lots of part files which leads to pressure on the driver, and

Re: 回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-08-15 Thread Koert Kuipers
i noticed commit today that seems to prepare for 3.0.1-rc1: commit 05144a5c10cd37ebdbb55fde37d677def49af11f Author: Ruifeng Zheng Date: Sat Aug 15 01:37:47 2020 + Preparing Spark release v3.0.1-rc1 so i tried to build spark on that commit and i get failure in sql: 09:36:57.371 ERROR

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-22 Thread Koert Kuipers
i would like to point out that SPARK-27194 is a fault tolerance bug that causes jobs to fail when any single task is retried. for us this is a major headache because we have to keep restarting jobs (and explain that spark is really fault tolerant generally, just not here).

Re: spark lacks fault tolerance with dynamic partition overwrite

2020-04-07 Thread Koert Kuipers
ah ok i was not aware of that jira issue. i will follow the progress there. thanks for letting me known On Tue, Apr 7, 2020 at 11:20 AM wuyi wrote: > Hi, Koert, > > The community is back to this issue to recently and there's already a fix > https://github.com/apache/spark/pull/26339 for it. >

spark lacks fault tolerance with dynamic partition overwrite

2020-04-02 Thread Koert Kuipers
i wanted to highlight here the issue we are facing with dynamic partition overwrite. it seems that any tasks that writes to disk using this feature and that need to be retried fails upon retry, leading to a failure for the entire job. we have seen this issue show up with preemption (task gets

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Koert Kuipers
gt;> >> I second this. If we need to keep a Hadoop 2.x profile around, why not >> make it Hadoop 2.8 or something newer? >> > > go for 2.9 > >> >> Koert Kuipers wrote: >> >>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Koert Kuipers
n. You can get the bug fixes. For > Hadoop 2.7, Apache Spark 3.0 will have the profile and the binary release. > Making Hadoop 3.2 profile as a default is irrelevant to that. > > Bests, > Dongjoon. > > > On Sat, Nov 2, 2019 at 09:35 Koert Kuipers wrote: > >> i don

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Koert Kuipers
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently. this includes deployments of cloudera distros, hortonworks

Re: concurrent writes with dynamic partition overwrite mode

2019-09-01 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-28945 On Sun, Sep 1, 2019 at 2:53 PM Koert Kuipers wrote: > hi, > i am struggling to understand if concurrent writes to same basedir but > different partitions are save with file sources such as parquet. > > i tested this in spark 2.4

concurrent writes with dynamic partition overwrite mode

2019-09-01 Thread Koert Kuipers
hi, i am struggling to understand if concurrent writes to same basedir but different partitions are save with file sources such as parquet. i tested this in spark 2.4 and spark 3.0.0-SNAPSHOT with real concurrent jobs on hdfs and it seemed to work fine. admittedly this was a rather limited test.

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Koert Kuipers
That one may be a real issue if it's > considered provided but isn't used that way. > > > On Mon, May 20, 2019 at 4:15 PM Koert Kuipers wrote: > > > > we run it without issues on hadoop 2.6 - 2.8 on top of my head. > > > > we however do some post-processing on the

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Koert Kuipers
we run it without issues on hadoop 2.6 - 2.8 on top of my head. we however do some post-processing on the tarball: 1) we fix the ownership of the files inside the tar.gz file (should be uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown user). 2) add avro-1.8.2.jar and

Re: Is there value in publishing nightly snapshots?

2019-04-16 Thread Koert Kuipers
we have used it at times to detect any breaking changes, since it allows us to run out internal unit tests against spark snapshot binaries but we can also build these snapshots in-house if you want to turn it off On Tue, Apr 16, 2019 at 9:29 AM Sean Owen wrote: > I noticed recently ... > >

Re: Do you use single-quote syntax for the DataFrame API?

2019-03-31 Thread Koert Kuipers
i don't care much about the symbol class but i find 'a much easier on the eye than $"a" or "a" and we use it extensively as such in many DSLs including spark. so its the syntax i would like to preserve not the class, which seems to be the opposite of what they are suggesting. On Sun, Mar 31,

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-02-01 Thread Koert Kuipers
introducing hive serdes in sql core sounds a bit like a step back to me. how can you build spark without hive support if there are imports for org. apache.hadoop.hive.serde2 in sql core? are these imports very limited in scope (and not suck all of hive into it)? On Fri, Feb 1, 2019 at 3:03 PM

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Koert Kuipers
i deployed 2.4.0 RC3 on our dev cluster and ran into issue with spark shell and jline. there is probably a simple workaround so this is not a serious issue but just wanted to let you know. https://issues.apache.org/jira/browse/SPARK-25783 On Mon, Oct 15, 2018 at 4:59 PM Imran Rashid wrote: > I

Re: Coalesce behaviour

2018-10-15 Thread Koert Kuipers
a shuffling is needed. > > > Am 15.10.2018 um 05:04 schrieb Koert Kuipers : > > sure, i understand currently the workaround is to add a shuffle. but > that's just a workaround, not a satisfactory solution: we shouldn't have to > introduce another shuffle (an expensive operation) jus

Re: Coalesce behaviour

2018-10-14 Thread Koert Kuipers
shuffle = true)` does. > > On Sat, Oct 13, 2018 at 10:05 PM Koert Kuipers wrote: > >> we have a collection of programs in dataframe api that all do big >> shuffles for which we use 2048+ partitions. this works fine but it produces >> a lot of (small) output files, which put p

Re: Coalesce behaviour

2018-10-13 Thread Koert Kuipers
rs) -> 20 reducers. The > sorting is done by 1000 tasks so no OOM. > > BTW have you tried DataFrame API? With Spark SQL, the memory management is > more precise, so even we only have 20 tasks to do the heavy sorting, the > system should just have more disk spills instead of OOM. &

Re: Coalesce behaviour

2018-10-12 Thread Koert Kuipers
ee and empty output files from these empty partitons >> which can be beaten by means of LazyOutputFormat in case of RDDs. >> >> >> >> On Mon, Oct 8, 2018, 23:57 Koert Kuipers wrote: >> >>> although i personally would describe this as a bug the answer will

Re: Coalesce behaviour

2018-10-08 Thread Koert Kuipers
although i personally would describe this as a bug the answer will be that this is the intended behavior. the coalesce "infects" the shuffle before it, making a coalesce useless for reducing output files after a shuffle with many partitions b design. your only option left is a repartition for

Re: Naming policy for packages

2018-08-15 Thread Koert Kuipers
ok it doesnt sound so bad if the maven identifier can have spark it in. no big deal! otherwise i was going to suggest "kraps". like kraps-xml scala> "spark".reverse res0: String = kraps On Wed, Aug 15, 2018 at 2:43 PM, Sean Owen wrote: > I'd refer you again to the trademark policy. In the

Re: Naming policy for packages

2018-08-15 Thread Koert Kuipers
mhhh thats somewhat unfortunate? its helpful to me that something is called say spark-xml, it tells me its xml for spark! any other name would probably be less informative. or is this still allowed? On Wed, Aug 15, 2018 at 11:35 AM, Reynold Xin wrote: > Unfortunately that’s an Apache

Re: [DISCUSS][SQL] Control the number of output files

2018-08-10 Thread Koert Kuipers
-23889 before. You can find some > mentions in the design document inside 23889. > Best regards > Lukas > > 2018-08-06 18:34 GMT+02:00 Koert Kuipers : > >> i went through the jiras targeting 2.4.0 trying to find a feature where >> spark would coalesce/repartition by size (s

Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread Koert Kuipers
i went through the jiras targeting 2.4.0 trying to find a feature where spark would coalesce/repartition by size (so merge small files automatically), but didn't find it. can someone point me to it? thank you. best, koert On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers wrote: > lukas, >

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread Koert Kuipers
lukas, what is the jira ticket for this? i would like to follow it's activity. thanks! koert On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec wrote: > Hi, > Yes, This feature is planned - Spark should be soon able to repartition > output by size. > Lukas > > > Dne st 25. 7. 2018 23:26 uživatel

Re: eager execution and debuggability

2018-05-08 Thread Koert Kuipers
yeah we run into this all the time with new hires. they will send emails explaining there is an error in the .write operation and they are debugging the writing to disk, focusing on that piece of code :) unrelated, but another frequent cause for confusion is cascading errors. like the

Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Koert Kuipers
gt; with). > > This practice of different profile builds was pretty unnecessary after > 2.2; > > it's mostly vestigial now. > > > > On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote: > >> > >> CDH 5 is still based on hadoop 2.6

Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Koert Kuipers
oh nevermind i am used to spark builds without hadoop included. but i realize that if hadoop is included it matters if its 2.6 or 2.7... On Thu, Feb 8, 2018 at 5:06 PM, Koert Kuipers <ko...@tresata.com> wrote: > wouldn't hadoop 2.7 profile means someone by introduces usage of some >

Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Koert Kuipers
builds was pretty unnecessary > after 2.2; it's mostly vestigial now. > > On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote: > >> CDH 5 is still based on hadoop 2.6 >> >> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> w

Re: Drop the Hadoop 2.6 profile?

2018-02-08 Thread Koert Kuipers
CDH 5 is still based on hadoop 2.6 On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen wrote: > Mostly just shedding the extra build complexity, and builds. The primary > little annoyance is it's 2x the number of flaky build failures to examine. > I suppose it allows using a 2.7+-only

Re: no-reopen-closed?

2018-01-28 Thread Koert Kuipers
what would prevent a person who kept reopening a ticket from now creating a new ticket every time it is closed? On Sat, Jan 27, 2018 at 8:41 PM, Sean Owen wrote: > Yeah you'd have to create a new one. You could link the two. > > > On Sat, Jan 27, 2018, 7:07 PM Xiao Li

Re: Spark 3

2018-01-19 Thread Koert Kuipers
i was expecting to be able to move to scala 2.12 sometime this year if this cannot be done in spark 2.x then that could be a compelling reason to move spark 3 up to 2018 i think hadoop 3 sounds great but personally i have no use case for it yet On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen

Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Koert Kuipers
lease. This one > looked potentially entirely internal. > I think if there are any doubts, we can leave it for Spark 3. There was a > bug report that needed a fix from Kryo 4, but it might be minor after all. > > > On Fri, Jan 19, 2018 at 11:05 AM Koert Kuipers <ko...@tresata.co

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-19 Thread Koert Kuipers
If a feature added recently breaks using kryo serializer with 2000+ partitions then how can it not be a regression? I mean I use kryo with more than 2000 partitions all the time, and it worked before. Or was I simply not hitting this bug because there are other conditions that also need to be

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Koert Kuipers
sounds like you are running into the fact that you cannot really put your classes before spark's on classpath? spark's switches to support this never really worked for me either. inability to control the classpath + inconsistent jars => trouble ? On Mon, May 1, 2017 at 2:36 PM, Frank Austin

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Koert Kuipers
we have been testing the 2.2.0 snapshots in the last few weeks for inhouse unit tests, integration tests and real workloads and we are very happy with it. the only issue i had so far (some encoders not being serialize anymore) has already been dealt with by wenchen. On Thu, Apr 27, 2017 at 6:49

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-28 Thread Koert Kuipers
this is column names containing dots that do not target fields inside structs? so not a.b as in field b inside struct a, but somehow a field called a.b? i didnt even know it is supported at all. its something i would never try because it sounds like a bad idea to go there... On Fri, Apr 28, 2017

Re: distributed computation of median

2017-04-17 Thread Koert Kuipers
Also q-tree is implemented in algebird, not hard to get it going in spark. That is another probabilistic data structure that is useful for this. On Apr 17, 2017 11:27, "Jason White" wrote: > Have you looked at t-digests? > > Calculating percentiles (including medians)

2.2 branch

2017-04-13 Thread Koert Kuipers
i see there is no 2.2 branch yet for spark. has this been pushed out until after 2.1.1 is done? thanks!

Re: internal unit tests failing against the latest spark master

2017-04-12 Thread Koert Kuipers
i confirmed that an Encoder[Array[Int]] is no longer serializable, and with my spark build from march 7 it was. i believe the issue is commit 295747e59739ee8a697ac3eba485d3439e4a04c3 and i send wenchen an email about it. On Wed, Apr 12, 2017 at 4:31 PM, Koert Kuipers <ko...@tresata.com>

Re: internal unit tests failing against the latest spark master

2017-04-12 Thread Koert Kuipers
i believe the error is related to an org.apache.spark.sql.expressions.Aggregator where the buffer type (BUF) is Array[Int] On Wed, Apr 12, 2017 at 4:19 PM, Koert Kuipers <ko...@tresata.com> wrote: > hey all, > today i tried upgrading the spark version we use internally by crea

internal unit tests failing against the latest spark master

2017-04-12 Thread Koert Kuipers
hey all, today i tried upgrading the spark version we use internally by creating a new internal release from the spark master branch. last time i did this was march 7. with this updated spark i am seeing some serialization errors in the unit tests for our own libraries. looks like a scala

Re: Spark Improvement Proposals

2017-03-09 Thread Koert Kuipers
gonna end up with a stackoverflow on recursive votes here On Thu, Mar 9, 2017 at 1:17 PM, Mark Hamstra wrote: > -0 on voting on whether we need a vote. > > On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin wrote: > >> I'm fine without a vote. (are we

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See: https://github.com/tresata/spark-sorted For Dataset/DataFrame such optimizations are done automatically, however it's

Re: Straw poll: dropping support for things like Scala 2.10

2017-03-02 Thread Koert Kuipers
given the issues with scala 2.10 and java 8 i am in favor of dropping scala 2.10 in next release On Sat, Feb 25, 2017 at 2:10 PM, Sean Owen wrote: > I want to bring up the issue of Scala 2.10 support again, to see how > people feel about it. Key opinions from the previous

Re: Request for comments: Java 7 removal

2017-02-14 Thread Koert Kuipers
what about the conversation about dropping scala 2.10? On Fri, Feb 10, 2017 at 11:47 AM, Sean Owen wrote: > As you have seen, there's a WIP PR to implement removal of Java 7 support: > https://github.com/apache/spark/pull/16871 > > I have heard several +1s at

Re: benefits of code gen

2017-02-10 Thread Koert Kuipers
; > On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> based on that i take it that math functions would be primary >> beneficiaries since they work on primitives. >> >> so if i take UnaryMathExpression as an ex

Re: benefits of code gen

2017-02-10 Thread Koert Kuipers
s.com> wrote: > With complex types it doesn't work as well, but for primitive types the > biggest benefit of whole stage codegen is that we don't even need to put > the intermediate data into rows or columns anymore. They are just variables > (stored in CPU registers). > >

benefits of code gen

2017-02-10 Thread Koert Kuipers
so i have been looking for a while now at all the catalyst expressions, and all the relative complex codegen going on. so first off i get the benefit of codegen to turn a bunch of chained iterators transformations into a single codegen stage for spark. that makes sense to me, because it avoids a

when is doGenCode called?

2017-02-08 Thread Koert Kuipers
hello all, i am trying to add an Expression to catalyst. my Expression compiles fine and has: override def eval(input: InternalRow): Any = ... override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = ... it also seems to run fine. but i only ever see eval get called. how do i tell

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread Koert Kuipers
could this be related to SPARK-18787? On Sun, Jan 22, 2017 at 1:45 PM, Reynold Xin wrote: > Are you using G1 GC? G1 sometimes uses a lot more memory than the size > allocated. > > > On Sun, Jan 22, 2017 at 12:58 AM StanZhai wrote: > >> Hi all, >> >> >>

Re: shapeless in spark 2.1.0

2016-12-29 Thread Koert Kuipers
sh that? > > On Thu, Dec 29, 2016, 1:05 PM Sean Owen <so...@cloudera.com> wrote: > >> It is breeze, but, what's the option? It can't be excluded. I think this >> falls in the category of things an app would need to shade in this >> situation. >> >> On

shapeless in spark 2.1.0

2016-12-29 Thread Koert Kuipers
i just noticed that spark 2.1.0 bring in a new transitive dependency on shapeless 2.0.0 shapeless is a popular library for scala users, and shapeless 2.0.0 is old (2014) and not compatible with more current versions. so this means a spark user that uses shapeless in his own development cannot

Re: Aggregating over sorted data

2016-12-22 Thread Koert Kuipers
degradation. Also when you have more than one keys in > a > partition, you will need to take care of that in your function applied to > each partition. > > > Koert Kuipers wrote > > it can also be done with repartition + sortWithinPartitions + > > mapPartitions. >

Re: Aggregating over sorted data

2016-12-21 Thread Koert Kuipers
it can also be done with repartition + sortWithinPartitions + mapPartitions. perhaps not as convenient but it does not rely on undocumented behavior. i used this approach in spark-sorted. see here:

Re: Aggregating over sorted data

2016-12-21 Thread Koert Kuipers
i think this works but it relies on groupBy and agg respecting the sorting. the api provides no such guarantee, so this could break in future versions. i would not rely on this i think... On Dec 20, 2016 18:58, "Liang-Chi Hsieh" wrote: Hi, Can you try the combination of

Re: Aggregating over sorted data

2016-12-19 Thread Koert Kuipers
take a look at: https://issues.apache.org/jira/browse/SPARK-15798 On Dec 19, 2016 00:17, "Robin East" wrote: This is also a feature we need for our time-series processing > On 19 Dec 2016, at 04:07, Liang-Chi Hsieh wrote: > > > Hi, > > As I know,

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-04 Thread Koert Kuipers
s okay with me. On Sun, Dec 4, 2016 at 11:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > somewhere between rc1 and the current head of branch-2.1 i started seeing > an NPE in our in-house unit tests for Dataset + Aggregator. i created > SPARK-18711 <https://issues.apache.org

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-12-04 Thread Koert Kuipers
somewhere between rc1 and the current head of branch-2.1 i started seeing an NPE in our in-house unit tests for Dataset + Aggregator. i created SPARK-18711 for this. On Mon, Nov 28, 2016 at

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Koert Kuipers
after seeing Hyukjin Kwon's comment in SPARK-17583 i think its safe to say that what i am seeing with csv is not bug or regression. it was unintended and/or unreliable behavior in spark 2.0.x On Wed, Nov 30, 2016 at 5:56 PM, Koert Kuipers <ko...@tresata.com> wrote: > running our inh

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Koert Kuipers
you are creating a new hive context per microbatch? is that a good idea? On Tue, Nov 22, 2016 at 8:51 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Has anybody seen this behavior (see tha attached picture) in Spark > Streaming? > It started to happen here after I changed the

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
, Koert Kuipers <ko...@tresata.com> wrote: > that makes sense. we have something like that inhouse as well, but not as > nice and not using shapeless (we simply relied on sbt-boilerplate to handle > all tuples and do not support case classes). > > however the frustrating p

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
gt; val x = format.read(Row("a", Seq(Row("a", 5 x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5 scala> format.write(x) res13: org.apache.spark.sql.Row = [a,List([a,5])] On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers <ko...@tresata.com> wrote: > agreed on

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
mentations and > provide your own, just like any type class. > > https://github.com/upio/spark-sql-formats > > > *From:* Michael Armbrust <mich...@databricks.com> > *Date:* October 26, 2016 at 12:50:23 PM PDT > *To:* Koert Kuipers <ko...@tresata.com> > *Cc:

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-28 Thread Koert Kuipers
thats correct in my experience: we have found a scala update to be straightforward and basically somewhat invisible to ops, but a java upgrade a pain because it is managed and "certified" by ops. On Fri, Oct 28, 2016 at 9:44 AM, Steve Loughran wrote: > Twitter just led

Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-18147 On Thu, Oct 27, 2016 at 4:55 PM, Koert Kuipers <ko...@tresata.com> wrote: > ok will do > > On Thu, Oct 27, 2016 at 4:51 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I would categorize these as bug

Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
en JIRAs as you find problems. > > On Thu, Oct 27, 2016 at 1:05 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> well i was using Aggregators that returned sequences of structs, or >> structs with sequences inside etc. and got compilation errors on the >> codegen. &g

Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > What kind of difficulties are you experiencing? > > On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i have been pushing my luck a bit and started using ExpressionEnc

encoders for more complex types

2016-10-27 Thread Koert Kuipers
i have been pushing my luck a bit and started using ExpressionEncoder for more complex types like sequences of case classes etc. (where the case classes only had primitives and Strings). it all seems to work but i think the wheels come off in certain cases in the code generation. i guess this is

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Koert Kuipers
+1 non binding compiled and unit tested in-house libraries against 2.0.2-rc1 successfully was able to build, deploy and launch on cdh 5.7 yarn cluster on a side note... these artifacts on staging repo having version 2.0.2 instead of 2.0.2-rc1 makes it somewhat dangerous to test against it in

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Koert Kuipers
if kryo could transparently be used for subtrees without narrowing the implicit that would be great On Wed, Oct 26, 2016 at 5:10 PM, Koert Kuipers <ko...@tresata.com> wrote: > i use kryo for the whole thing currently > > it would be better to use it for the subtree > > On W

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Koert Kuipers
gt; > Also, I'm saying I like the idea of having a kryo fallback. I don't see > the point of narrowing the the definition of the implicit. > > On Wed, Oct 26, 2016 at 1:07 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> for example (the log shows when it create

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Koert Kuipers
idence$: using kryo encoder for scala.Option[Set[String]] res6: org.apache.spark.sql.Encoder[Option[Set[String]]] = class[value[0]: binary] On Wed, Oct 26, 2016 at 4:00 PM, Koert Kuipers <ko...@tresata.com> wrote: > why would generating implicits for ProductN where you also require

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Koert Kuipers
nt to > make this the default though, as behavior would change with each release > that adds support for more types. I would be very supportive of making > this fallback a built-in option though. > > On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote: > >>

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-26 Thread Koert Kuipers
that sounds good to me On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin wrote: > We can do the following concrete proposal: > > 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr > 2017). > > 2. In Spark 2.1.0 release, aggressively and explicitly announce

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Koert Kuipers
ichael Armbrust <mich...@databricks.com > > wrote: > >> Hmm, that is unfortunate. Maybe the best solution is to add support for >> sets? I don't think that would be super hard. >> >> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers <ko...@tresata.com>

getting encoder implicits to be more accurate

2016-10-25 Thread Koert Kuipers
i am trying to use encoders as a typeclass where if it fails to find an ExpressionEncoder it falls back to KryoEncoder. the issue seems to be that ExpressionEncoder claims a little more than it can handle here: implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Koert Kuipers
it will take time before all libraries that spark depends on are available for scala 2.12, so we are not talking spark 2.1.x and probably also not 2.2.x for scala 2.12 it technically makes sense to drop java 7 and scala 2.10 around the same time as scala 2.12 is introduced we are still heavily

incorrect message that path appears to be local

2016-10-12 Thread Koert Kuipers
i see this warning when running jobs on cluster: 2016-10-12 14:46:47 WARN spark.SparkContext: Spark is not running in local mode, therefore the checkpoint directory must not be on the local filesystem. Directory '/tmp' appears to be on the local filesystem. however the checkpoint "directory"

Re: Structured Streaming with Kafka sources/sinks

2016-08-27 Thread Koert Kuipers
thats great is this effort happening anywhere that is publicly visible? github? On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin wrote: > We (the team at Databricks) are working on one currently. > > > On Mon, Aug 15, 2016 at 7:26 PM, Cody Koeninger >

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
it's sufficient to note this somewhere in the documentation > (or simply assume that user understanding of transformations vs. actions > means they know this), or are there other implications that need to be > considered? > > On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers <ko...@tresata

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Koert Kuipers
The tricky part is that the action needs to be inside the with block, not just the transformation that uses the persisted data. On Aug 5, 2016 1:44 PM, "Nicholas Chammas" wrote: Okie doke, I've filed a JIRA for this here: https://issues.apache.

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Koert Kuipers
val of support at our best guess about when admins are just *planning* to migrate isn't a very good idea. Second, I don't see the significant difficulty or harm in continuing to support Java 7 for a while longer. On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers <ko...@tresata.com> wrote: > dro

drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Koert Kuipers
dropping java 7 support was considered for spark 2.0.x but we decided against it. ideally dropping support for a java version should be communicated far in advance to facilitate the transition. is this the right time to make that decision and start communicating it (mailing list, jira, etc.)?

Re: spark git commit: [SPARK-15204][SQL] improve nullability inference for Aggregator

2016-07-05 Thread Koert Kuipers
204][SQL] improve nullability inference for Aggregator > > > > ## What changes were proposed in this pull request? > > > > TypedAggregateExpression sets nullable based on the schema of the > outputEncoder > > > > ## How was this patch tested? > > > > Add t

Dataset and Aggregator API pain points

2016-07-02 Thread Koert Kuipers
after working with the Dataset and Aggregator apis for a few weeks porting some fairly complex RDD algos (an overall pleasant experience) i wanted to summarize the pain points and some suggestions for improvement given my experience. all of these are already mentioned on mailing list or jira, but

Re: branch-2.0 is now 2.0.1-SNAPSHOT?

2016-07-02 Thread Koert Kuipers
if > that helps. > > On Sat, Jul 2, 2016 at 5:25 PM, Koert Kuipers <ko...@tresata.com> wrote: > > You do, snapshots for spark 2.0.0-SNAPSHOT are updated daily on the > apache > > snapshot repo. I use them in our own unit tests to find regressions etc. > in > &g

Re: branch-2.0 is now 2.0.1-SNAPSHOT?

2016-07-02 Thread Koert Kuipers
t whether it should be 2.0.1-SNAPSHOT > at this stage because 2.0.0 is not yet released. But I'm not sure we > publish snapshots anyway? > > On Sat, Jul 2, 2016 at 5:41 AM, Koert Kuipers <ko...@tresata.com> wrote: > > is that correct? > > where do i get the latest 2.0.0-SNAPSHOT? > > thanks, > > koert >

  1   2   3   >