Number of partitions in Dataset aggregations

2017-03-01 Thread Jakub Dubovsky
. Any thoughts or pointers to relevant design documents appreciated... Thanks! Jakub Dubovsky

Re: Dataset encoders for further types?

2016-12-16 Thread Jakub Dubovsky
hat is fixed would be for you to manually specify the > kryo > encoder > <http://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/Encoders.html#kryo(scala.reflect.ClassTag)> > . > > On Thu, Dec 15, 2016 at 8:18 AM, Jakub Dubovsky < > spark.dubovsky.ja...@gmail.c

Dataset encoders for further types?

2016-12-15 Thread Jakub Dubovsky
efined case classes containing scala.collection.immutable.List(s). This does not work now because these lists are converted to ArrayType (Seq). This then fails a constructor lookup because of seq-is-not-a-list error... This means that for now we are stuck with using RDDs. Thanks for any insights!

Re: import sql.implicits._

2016-10-15 Thread Jakub Dubovsky
SparkSession from somewhere. by importing the implicits from >>> spark.implicits._ they have access to a SparkSession for operations like >>> this. >>> >>> On Fri, Oct 14, 2016 at 4:42 PM, Jakub Dubovsky < >>> spark.dubovsky.ja...@gmail.com> wrote: >>> >&

import sql.implicits._

2016-10-14 Thread Jakub Dubovsky
Hey community, I would like to *educate* myself about why all *sql implicits* (most notably conversion to Dataset API) are imported from *instance* of SparkSession and not using static imports. Having this design one runs into problems like this

Re: Why there is no top method in dataset api

2016-09-13 Thread Jakub Dubovsky
omputation of "top N" on a Dataset, so I don't think this is > relevant. > > > ​orderBy + take is already the way to accomplish "Dataset.top". It works > on Datasets, and therefore DataFrames too, for the reason you give. I'm not > sure what you're asking there

Re: Why there is no top method in dataset api

2016-09-05 Thread Jakub Dubovsky
taFrame-like > counterpart already that doesn't really need wrapping in a different > API. > > On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky > <spark.dubovsky.ja...@gmail.com> wrote: > > Hey all, > > > > in RDD api there is very usefull method called top. It finds

Why there is no top method in dataset api

2016-09-01 Thread Jakub Dubovsky
Hey all, in RDD api there is very usefull method called top. It finds top n records in according to certain ordering without sorting all records. Very usefull! There is no top method nor similar functionality in Dataset api. Has anybody any clue why? Is there any specific reason for this? Any

Re: Does a driver jvm houses some rdd partitions?

2016-09-01 Thread Jakub Dubovsky
this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 31 August 2016 at 14:53, Jakub Dubovsky <spark.dubovsky.ja...@gmail.com > > wrote: >

Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Jakub Dubovsky
Hey all, I have a conceptual question which I have hard time finding answer for. Is the jvm where spark driver is running also used to run computations over rdd partitions and persist them? The answer is obvious for local mode (yes). But when it runs on yarn/mesos/standalone with many executors

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
e: > Hi, > > An argument for `functions.count` is needed for per-column counting; > df.groupBy($"a").agg(count($"b")) > > // maropu > > On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> See the first example in: &g

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
; Are you referring to the following method in > sql/core/src/main/scala/org/apache/spark/sql/functions.scala : > > def count(e: Column): Column = withAggregateFunction { > > Did you notice this method ? > > def count(columnName: String): TypedColumn[Any, Long] = > > On

Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Hey sparkers, an aggregate function *count* in *org.apache.spark.sql.functions* package takes a *column* as an argument. Is this needed for something? I find it confusing that I need to supply a column there. It feels like it might be distinct count or something. This can be seen in latest

Re: RDD of ImmutableList

2015-10-07 Thread Jakub Dubovsky
I did not realized that scala's and java's immutable collections uses different api which causes this. Thank you for reminder. This makes some sense now... -- Původní zpráva -- Od: Jonathan Coveney <jcove...@gmail.com> Komu: Jakub Dubovsky <spark.dubovsky.ja...@seznam.

RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
But I cannot think of a workaround and I do not believe that using ImmutableList with RDD is not possible. How this is solved?   Thank you in advance!    Jakub Dubovsky

Re: RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
which would translate the data during (de) serialization?   Thanks!   Jakub Dubovsky -- Původní zpráva -- Od: Igor Berman <igor.ber...@gmail.com> Komu: Jakub Dubovsky <spark.dubovsky.ja...@seznam.cz> Datum: 5. 10. 2015 20:11:35 Předmět: Re: RDD of ImmutableList &qu

Re: Including data nucleus tools

2014-12-20 Thread Jakub Dubovsky
Hi DB,   I cherry-picked the commit into branch-1.2 and it solved the problem. It solves the problem but has some bits and pieces around which was not finalized thus reverted beeing late in release process.   Jakub -- Just out of my curiosity. Do you manually apply this patch and see if