Hi Mich, I was wondering what are the advantages of using helper methods instead of one SQL multiline string? (I rarely (if ever) use helper methods, but maybe I'm missing something)
Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote: > Hi Richard, > > Windowing/Analytics for stats are pretty simple. Example > > import org.apache.spark.sql.expressions.Window val wSpec = > Window.partitionBy('transactiontype).orderBy(desc("transactiondate")) > df.filter('transactiondescription.contains(HASHTAG)).select('transact- > iondate,'transactiondescription, > *rank().over(wSpec).as("rank")).*filter($"rank"===1).show(1) > > val wSpec5 = > Window.partitionBy('hashtag).orderBy(substring('transactiondate,1,4)) > val newDF = df.where('transactiontype === "DEB" && ('transactiondescr- > iption).isNotNull).select(substring('transactiondate,1,4).as("Year"), > 'hashtag.as("Retailer"),*round(sum('debitamount).over(wSpec5),2).as("- > Spent"*)) > newDF.distinct.orderBy('year,'Retailer).collect.foreach(println) > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for > any loss, damage or destruction of data or any other property which > may arise from relying on this email's technical content is explicitly > disclaimed. The author will in no case be liable for any monetary > damages arising from such loss, damage or destruction. > > > > > On 25 August 2016 at 08:24, Richard Siebeling > <rsiebel...@gmail.com> wrote: >> Hi Mich, >> >> thanks for the suggestion, I hadn't thought of that. We'll need to >> gather the statistics in two ways, incremental when new data arrives >> and over the complete set when aggregating or filtering (because I >> think it's difficult to gather statistics while aggregating or >> filtering). >> The analytic functions could help when gathering the statistics over >> the whole set, >> >> kind regards, >> Richard >> >> >> >> On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh >> <mich.talebza...@gmail.com> wrote: >>> Hi Richard, >>> >>> can you use analytics functions for this purpose on DF >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>> for any loss, damage or destruction of data or any other property >>> which may arise from relying on this email's technical content is >>> explicitly disclaimed. The author will in no case be liable for any >>> monetary damages arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On 24 August 2016 at 21:37, Richard Siebeling <rsiebel...@gmail.com> >>> wrote: >>>> Hi Mich, >>>> >>>> I'd like to gather several statistics per column in order to make >>>> analysing data easier. These two statistics are some examples, >>>> other statistics I'd like to gather are the variance, the median, >>>> several percentiles, etc. We are building a data analysis platform >>>> based on Spark, >>>> >>>> kind regards, >>>> Richard >>>> >>>> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh >>>> <mich.talebza...@gmail.com> wrote: >>>>> Hi Richard, >>>>> >>>>> What is the business use case for such statistics? >>>>> >>>>> HTH >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>> for any loss, damage or destruction of data or any other property >>>>> which may arise from relying on this email's technical content is >>>>> explicitly disclaimed. The author will in no case be liable for >>>>> any monetary damages arising from such loss, damage or >>>>> destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On 24 August 2016 at 16:01, Bedrytski Aliaksandr >>>>> <sp...@bedryt.ski> wrote: >>>>>> __ >>>>>> Hi Richard, >>>>>> >>>>>> these intermediate statistics should be calculated from the >>>>>> result of the calculation or during the aggregation? >>>>>> If they can be derived from the resulting dataframe, why not to >>>>>> cache (persist) that result just after the calculation? >>>>>> Then you may aggregate statistics from the cached dataframe. >>>>>> This way it won't hit performance too much. >>>>>> >>>>>> Regards >>>>>> -- >>>>>> Bedrytski Aliaksandr >>>>>> sp...@bedryt.ski >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote: >>>>>>> Hi, >>>>>>> >>>>>>> what is the best way to calculate intermediate column statistics >>>>>>> like the number of empty values and the number of distinct >>>>>>> values each column in a dataset when aggregating of filtering >>>>>>> data next to the actual result of the aggregate or the filtered >>>>>>> data? >>>>>>> >>>>>>> We are developing an application in which the user can slice-and- >>>>>>> dice through the data and we would like to, next to the actual >>>>>>> resulting data, get column statistics of each column in the >>>>>>> resulting dataset. We prefer to calculate the column statistics >>>>>>> on the same pass over the data as the actual aggregation or >>>>>>> filtering, is that possible? >>>>>>> >>>>>>> We could sacrifice a little bit of performance (but not too >>>>>>> much), that's why we prefer one pass... >>>>>>> >>>>>>> Is this possible in the standard Spark or would this mean >>>>>>> modifying the source a little bit and recompiling? Is that >>>>>>> feasible / wise to do? >>>>>>> >>>>>>> thanks in advance, >>>>>>> Richard >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>