Re: Best way to calculate intermediate column statistics

Bedrytski Aliaksandr Fri, 26 Aug 2016 07:24:45 -0700

Hi Mich,

I was wondering what are the advantages of using helper methods instead
of one SQL multiline string?
(I rarely (if ever) use helper methods, but maybe I'm missing something)


Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote:
> Hi Richard,
>
> Windowing/Analytics for stats are pretty simple. Example
>
> import org.apache.spark.sql.expressions.Window val wSpec =
> Window.partitionBy('transactiontype).orderBy(desc("transactiondate"))
> df.filter('transactiondescription.contains(HASHTAG)).select('transact-
> iondate,'transactiondescription,
> *rank().over(wSpec).as("rank")).*filter($"rank"===1).show(1)
>
> val wSpec5 =
> Window.partitionBy('hashtag).orderBy(substring('transactiondate,1,4))
> val newDF = df.where('transactiontype === "DEB" && ('transactiondescr-
> iption).isNotNull).select(substring('transactiondate,1,4).as("Year"),
> 'hashtag.as("Retailer"),*round(sum('debitamount).over(wSpec5),2).as("-
> Spent"*))
> newDF.distinct.orderBy('year,'Retailer).collect.foreach(println)
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>
>
>
> On 25 August 2016 at 08:24, Richard Siebeling
> <rsiebel...@gmail.com> wrote:
>> Hi Mich,
>>
>> thanks for the suggestion, I hadn't thought of that. We'll need to
>> gather the statistics in two ways, incremental when new data arrives
>> and over the complete set when aggregating or filtering (because I
>> think it's difficult to gather statistics while aggregating or
>> filtering).
>> The analytic functions could help when gathering the statistics over
>> the whole set,
>>
>> kind regards,
>> Richard
>>
>>
>>
>> On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>>> Hi Richard,
>>>
>>> can you use analytics functions for this purpose on DF
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn *
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property
>>> which may arise from relying on this email's technical content is
>>> explicitly disclaimed. The author will in no case be liable for any
>>> monetary damages arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On 24 August 2016 at 21:37, Richard Siebeling <rsiebel...@gmail.com>
>>> wrote:
>>>> Hi Mich,
>>>>
>>>> I'd like to gather several statistics per column in order to make
>>>> analysing data easier. These two statistics are some examples,
>>>> other statistics I'd like to gather are the variance, the median,
>>>> several percentiles, etc.  We are building a data analysis platform
>>>> based on Spark,
>>>>
>>>> kind regards,
>>>> Richard
>>>>
>>>> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh
>>>> <mich.talebza...@gmail.com> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> What is the business use case for such statistics?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn *
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>> for any loss, damage or destruction of data or any other property
>>>>> which may arise from relying on this email's technical content is
>>>>> explicitly disclaimed. The author will in no case be liable for
>>>>> any monetary damages arising from such loss, damage or
>>>>> destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 24 August 2016 at 16:01, Bedrytski Aliaksandr
>>>>> <sp...@bedryt.ski> wrote:
>>>>>> __
>>>>>> Hi Richard,
>>>>>>
>>>>>> these intermediate statistics should be calculated from the
>>>>>> result of the calculation or during the aggregation?
>>>>>> If they can be derived from the resulting dataframe, why not to
>>>>>> cache (persist) that result just after the calculation?
>>>>>> Then you may aggregate statistics from the cached dataframe.
>>>>>> This way it won't hit performance too much.
>>>>>>
>>>>>> Regards
>>>>>> --
>>>>>>   Bedrytski Aliaksandr
>>>>>>   sp...@bedryt.ski
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> what is the best way to calculate intermediate column statistics
>>>>>>> like the number of empty values and the number of distinct
>>>>>>> values each column in a dataset when aggregating of filtering
>>>>>>> data next to the actual result of the aggregate or the filtered
>>>>>>> data?
>>>>>>>
>>>>>>> We are developing an application in which the user can slice-and-
>>>>>>> dice through the data and we would like to, next to the actual
>>>>>>> resulting data, get column statistics of each column in the
>>>>>>> resulting dataset. We prefer to calculate the column statistics
>>>>>>> on the same pass over the data as the actual aggregation or
>>>>>>> filtering, is that possible?
>>>>>>>
>>>>>>> We could sacrifice a little bit of performance (but not too
>>>>>>> much), that's why we prefer one pass...
>>>>>>>
>>>>>>> Is this possible in the standard Spark or would this mean
>>>>>>> modifying the source a little bit and recompiling? Is that
>>>>>>> feasible / wise to do?
>>>>>>>
>>>>>>> thanks in advance,
>>>>>>> Richard
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>

Re: Best way to calculate intermediate column statistics

Reply via email to