Re: Best way to calculate intermediate column statistics

Mich Talebzadeh Fri, 26 Aug 2016 08:15:30 -0700

Hi Bedrytski,

I assume you are referring to my code above.


The alternative SQL would be (the first code with rank)

SELECT *
FROM (
      SELECT transactiondate, transactiondescription, debitamount
      , RANK() OVER (ORDER BY transactiondate desc) AS rank
      FROM <hive_table> WHERE transactiondescription LIKE '%HASHTAG%'
     ) tmp
WHERE rank = 1;
It all depends which one you are comfortable with. I can do either but I
guess coming from Data background it is easier for me to do the SQL part.
In addition with SQL you can run exactly the same code in Spark Thrift
Server or Hive.

You can run both codes in Zeppelin. You can run only SQL code in Tableau or
Squirrel :)

HTH



;




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 August 2016 at 15:18, Bedrytski Aliaksandr <sp...@bedryt.ski> wrote:

> Hi Mich,
>
> I was wondering what are the advantages of using helper methods instead of
> one SQL multiline string?
> (I rarely (if ever) use helper methods, but maybe I'm missing something)
>
> Regards
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Thu, Aug 25, 2016, at 11:39, Mich Talebzadeh wrote:
>
> Hi Richard,
>
> Windowing/Analytics for stats are pretty simple. Example
>
> import org.apache.spark.sql.expressions.Window
> val wSpec = Window.partitionBy('transactiontype).orderBy(desc(
> "transactiondate"))
> df.filter('transactiondescription.contains(HASHTAG)).select('
> transactiondate,'transactiondescription, *rank().over(wSpec).as("rank")).*
> filter($"rank"===1).show(1)
>
> val wSpec5 = Window.partitionBy('hashtag).orderBy(substring('
> transactiondate,1,4))
> val newDF = df.where('transactiontype === "DEB" &&
> ('transactiondescription).isNotNull).select(substring('
> transactiondate,1,4).as("Year"), 'hashtag.as("Retailer"),
> *round(sum('debitamount).over(wSpec5),2).as("Spent"*))
> newDF.distinct.orderBy('year,'Retailer).collect.foreach(println)
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On 25 August 2016 at 08:24, Richard Siebeling <rsiebel...@gmail.com>
> wrote:
>
> Hi Mich,
>
> thanks for the suggestion, I hadn't thought of that. We'll need to gather
> the statistics in two ways, incremental when new data arrives and over the
> complete set when aggregating or filtering (because I think it's difficult
> to gather statistics while aggregating or filtering).
> The analytic functions could help when gathering the statistics over the
> whole set,
>
> kind regards,
> Richard
>
>
>
> On Wed, Aug 24, 2016 at 10:54 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Hi Richard,
>
> can you use analytics functions for this purpose on DF
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On 24 August 2016 at 21:37, Richard Siebeling <rsiebel...@gmail.com>
> wrote:
>
> Hi Mich,
>
> I'd like to gather several statistics per column in order to make
> analysing data easier. These two statistics are some examples, other
> statistics I'd like to gather are the variance, the median, several
> percentiles, etc.  We are building a data analysis platform based on Spark,
>
> kind regards,
> Richard
>
> On Wed, Aug 24, 2016 at 6:52 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Hi Richard,
>
> What is the business use case for such statistics?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On 24 August 2016 at 16:01, Bedrytski Aliaksandr <sp...@bedryt.ski> wrote:
>
>
> Hi Richard,
>
> these intermediate statistics should be calculated from the result of the
> calculation or during the aggregation?
> If they can be derived from the resulting dataframe, why not to cache
> (persist) that result just after the calculation?
> Then you may aggregate statistics from the cached dataframe.
> This way it won't hit performance too much.
>
> Regards
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote:
>
> Hi,
>
> what is the best way to calculate intermediate column statistics like the
> number of empty values and the number of distinct values each column in a
> dataset when aggregating of filtering data next to the actual result of the
> aggregate or the filtered data?
>
> We are developing an application in which the user can slice-and-dice
> through the data and we would like to, next to the actual resulting data,
> get column statistics of each column in the resulting dataset. We prefer to
> calculate the column statistics on the same pass over the data as the
> actual aggregation or filtering, is that possible?
>
> We could sacrifice a little bit of performance (but not too much), that's
> why we prefer one pass...
>
> Is this possible in the standard Spark or would this mean modifying the
> source a little bit and recompiling? Is that feasible / wise to do?
>
> thanks in advance,
> Richard
>
>
>
>
>
>
>

Re: Best way to calculate intermediate column statistics

Reply via email to