Re: Missing stack function from SQL functions API

2021-06-15 Thread Khalid Mammadov
Hi David

If you need alternative way to do it you can use below:
df.select(expr("stack(2, 1,2,3)"))
Or
df.withColumn('stacked', expr("stack(2, 1,2,3)"))

Thanks
Khalid

On Mon, 14 Jun 2021, 10:14 ,  wrote:

> I noticed that the stack SQL function
>  is missing from the 
> functions
> API
> .
> Could we add it?
>


Re: Spark-sql can replace Hive ?

2021-06-15 Thread Battula, Brahma Reddy
Currently I am using hive sql engine for adhoc queries. As spark-sql also 
supports this, I want migrate from hive.




From: Mich Talebzadeh 
Date: Thursday, 10 June 2021 at 8:12 PM
To: Battula, Brahma Reddy 
Cc: ayan guha , d...@spark.apache.org 
, user@spark.apache.org 
Subject: Re: Spark-sql can replace Hive ?
These are different things. Spark provides a computational layer and a dialogue 
of SQL based on Hive.

Hive is a DW on top of HDFS. What are you trying to replace?

HTH





 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy  
wrote:
Thanks for prompt reply.

I want to replace hive with spark.




From: ayan guha mailto:guha.a...@gmail.com>>
Date: Thursday, 10 June 2021 at 4:35 PM
To: Battula, Brahma Reddy 
Cc: d...@spark.apache.org 
mailto:d...@spark.apache.org>>, 
user@spark.apache.org 
mailto:user@spark.apache.org>>
Subject: Re: Spark-sql can replace Hive ?
Would you mind expanding the ask? Spark Sql can use hive by itaelf

On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy 
 wrote:
Hi

Would like know any refences/docs to replace hive with spark-sql completely 
like how migrate the existing data in hive.?

thanks


--
Best Regards,
Ayan Guha


Re: Spark-sql can replace Hive ?

2021-06-15 Thread Mich Talebzadeh
OK you mean use spark.sql as opposed to HiveContext.sql?

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("")

replace with

spark.sql("")
?


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 15 Jun 2021 at 18:00, Battula, Brahma Reddy 
wrote:

> Currently I am using hive sql engine for adhoc queries. As spark-sql also
> supports this, I want migrate from hive.
>
>
>
>
>
>
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Thursday, 10 June 2021 at 8:12 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *ayan guha , d...@spark.apache.org <
> d...@spark.apache.org>, user@spark.apache.org 
> *Subject: *Re: Spark-sql can replace Hive ?
>
> These are different things. Spark provides a computational layer and a
> dialogue of SQL based on Hive.
>
>
>
> Hive is a DW on top of HDFS. What are you trying to replace?
>
>
>
> HTH
>
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 10 Jun 2021 at 12:09, Battula, Brahma Reddy
>  wrote:
>
> Thanks for prompt reply.
>
>
>
> I want to replace hive with spark.
>
>
>
>
>
>
>
>
>
> *From: *ayan guha 
> *Date: *Thursday, 10 June 2021 at 4:35 PM
> *To: *Battula, Brahma Reddy 
> *Cc: *d...@spark.apache.org , user@spark.apache.org <
> user@spark.apache.org>
> *Subject: *Re: Spark-sql can replace Hive ?
>
> Would you mind expanding the ask? Spark Sql can use hive by itaelf
>
>
>
> On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy
>  wrote:
>
> Hi
>
>
>
> Would like know any refences/docs to replace hive with spark-sql
> completely like how migrate the existing data in hive.?
>
>
>
> thanks
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>


Re: What happens if a random forest max bins is set too high?

2021-06-15 Thread Reed Villanueva
I *think* solved issue.
Will update w/ details after further testing / inspection.

On Mon, Jun 14, 2021 at 8:50 PM Reed Villanueva 
wrote:

> What happens if a random forest "max bins" hyperparameter is set too high?
>
> When training a sparkml random forest (
> https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
> ) with maxBins set roughly equal to the max number of distinct
> categorical values for any given feature I see OK performance metrics. But
> when I set it closer to 2x or 3x the number of distinct categorical values,
> performance is terrible (eg. accuracy (in the case of a binary classifier)
> being no better than just the actual distribution of responses in the
> dataset) and the feature importances (
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.RandomForestClassificationModel.featureImportances
> ) being all zeros (as opposed to when using the lower initial maxBins value
> where it at does show *something* for the importances).
>
> I would not think that there would be such a huge difference just from a
> change in max bins like this (esp. the difference in seeing *something* vs
> absolutely nothing / all zeros for the feature importances).
>
> What could be happening under the hood of the algo that causes such
> different outcomes when this parameter is changed like this?
>


Why does sparkml random forest classifier not support maxBins < number of total categorical values?

2021-06-15 Thread Reed Villanueva
Why does sparkml's random forest classifier not support maxBins

(M)
< (K) number of total categorical values?

My understanding of decision tree bins is that...

Statistical data binning is basically a form of quantization where you map
> a set of numbers with continuous values into *smaller*, more manageable
> “bins.”

https://clevertap.com/blog/numerical-vs-categorical-variables-decision-trees/

...which makes it seem like you wouldn't ever really want to use M > K in
any case, yet the docs seem to imply that is not the case.

Must be >=2 and >= number of categories for any categorical feature

Plus, when I use the random forest implementation in H2O
, I
do have the option of using less bins that the total number of distinct
categorical values.

Could anyone explain the reason for this restriction in spark? Is there
some kind of particular data preprocessing / feature engineering users are
expected to have done beforehand? Am I misunderstanding something about
decision trees (eg. is it categorical don't really ever *need* to be binned
in the first place and the setting is just for numerical values or
something)?