Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread AaronLee
@srowen. You are totally right, the model was not trained correctly. But it
is weird as the dataset I used actually has 50m rows. It has binary label
with 20% positive, and 1 feature in feature vector. Do not understand why it
does not trained correctly 


```
scala> df2.count
res56: Long = 48174858

scala> df2.show
++-+
|features|label|
++-+
|  [14.0]|  1.0|
|   [2.0]|  0.0|
|   [2.0]|  0.0|
|   [1.0]|  1.0|
|[0.970286102295]|  1.0|
|[1.960381469727]|  0.0|
|[0.990095367432]|  0.0|
|[11.73771118164]|  1.0|
|   [1.0]|  0.0|
|[0.980190734863]|  0.0|
|   [5.0]|  0.0|
| [5.94057220459]|  1.0|
|  [11.0]|  0.0|
|   [4.0]|  0.0|
|   [1.0]|  1.0|
|[1.970286102295]|  0.0|
| [6.98771118164]|  0.0|
|[0.970286102295]|  0.0|
|[0.970286102295]|  0.0|
|[0.990095367432]|  0.0|
++-+
only showing top 20 rows


scala> df2.printSchema
root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)

scala> val dt = new
DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("features").setMaxBins(10)
dt: org.apache.spark.ml.classification.DecisionTreeClassifier =
dtc_2b6b6e170840

scala>  val dtm = dt.fit(df2)
*dtm: org.apache.spark.ml.classification.DecisionTreeClassificationModel =
DecisionTreeClassificationModel (uid=dtc_2b6b6e170840) of depth 0 with 1
nodes
*

scala> val df3 = dtm.transform(df2)
df3: org.apache.spark.sql.DataFrame = [features: vector, label: double ... 3
more fields]

scala>  df3.show(100,false)
++-+--++--+
|features|label|rawPrediction |probability  
  
|prediction|
++-+--++--+
|[14.0]  |1.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[2.0]   |0.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[2.0]   |0.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[1.0]   |1.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[0.970286102295]|1.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[1.960381469727]|0.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |
|[0.990095367432]|0.0 
|[3.872715E7,9447708.0]|[0.8038871645454565,0.19611283545454353]|0.0   |

```




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread Sean Owen
Hm, the root is a leaf? it's possible but that means there are no splits.
If it's a toy example, could be.
This was just off the top of my head looking at the code, so could be
missing something, but a non-trivial tree should start with an internalnode.

On Thu, Jun 11, 2020 at 11:01 PM AaronLee  wrote:

> Thanks srowen. I also checked
> https://www.programcreek.com/scala/org.apache.spark.ml.tree.InternalNode.
> Splits are available via "InternalNode" ".split" attribute. But
> "dtm.rootNode"  belongs to "LeafNode".
>
> ```
> scala> dtm.rootNode
> res9: org.apache.spark.ml.tree.Node = LeafNode(prediction = 0.0, impurity =
> 0.3153051824490453)
>
> scala> dftm.rootNode.
> impurity   prediction
>
> scala> dftm.rootNode.getClass.getSimpleName
> res13: String = LeafNode
>
> scala> import org.apache.spark.ml.tree.{InternalNode, LeafNode, Node}
> import org.apache.spark.ml.tree.{InternalNode, LeafNode, Node}
>
> scala> val intnode = dftm.rootNode.asInstanceOf[InternalNode]
> java.lang.ClassCastException: org.apache.spark.ml.tree.LeafNode cannot be
> cast to org.apache.spark.ml.tree.InternalNode
>   ... 51 elided
>
> ```
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread AaronLee
Thanks srowen. I also checked
https://www.programcreek.com/scala/org.apache.spark.ml.tree.InternalNode.
Splits are available via "InternalNode" ".split" attribute. But
"dtm.rootNode"  belongs to "LeafNode". 

```
scala> dtm.rootNode
res9: org.apache.spark.ml.tree.Node = LeafNode(prediction = 0.0, impurity =
0.3153051824490453)

scala> dftm.rootNode.
impurity   prediction

scala> dftm.rootNode.getClass.getSimpleName
res13: String = LeafNode

scala> import org.apache.spark.ml.tree.{InternalNode, LeafNode, Node}
import org.apache.spark.ml.tree.{InternalNode, LeafNode, Node}

scala> val intnode = dftm.rootNode.asInstanceOf[InternalNode]
java.lang.ClassCastException: org.apache.spark.ml.tree.LeafNode cannot be
cast to org.apache.spark.ml.tree.InternalNode
  ... 51 elided

```



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unsubscribe martha focker

2020-06-11 Thread hashbonduo
When  these Matha Fockers don't even know how to unsubscribe.
What hope of them becoming data scientist ?
I mean first you have to train on some maths.Algebra  statistics 
calculus from people who have no idea of data science or machine
learning. Classifier algorithms ,  recommended systems.Parallelism.
Deal with  cross terminology like  labels features rows instances 
inputs outputs target classWhen they are just rows and columns in a
table .
Then python scala RThen figure out the machine learning concepts.like
tensorflow  pytorch scikit-learn   blah blah.
You cant even unsubscribe  .I am going to ask for more salary when I
grow up and become data science. 
Hopefully before COVID-19  locked is overWith Trump in charge i am
confident I have enough training time.

On 11/06/2020 at 5:27 PM, "Angel Angel"  wrote:

Re: Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread Sean Owen
You should be able to look at dtm.rootNode and, treating it as an
InternalNode, get the .split from it

On Thu, Jun 11, 2020 at 7:02 PM AaronLee  wrote:

> I am following  official spark 2.4.3 tutorial
> <
> https://spark.apache.org/docs/2.4.3/ml-classification-regression.html#decision-tree-classifier>
>
> trained a decision tree model. How to extract split points from the trained
> model?
>
> // model
> val dt = new DecisionTreeClassifier()
>   .setLabelCol("indexedLabel")
>   .setFeaturesCol("indexedFeatures")
>   .setMaxBins(10)
>
> // Train model.  This also runs the indexers.
> val dtm = dt.fit(trainingData)
>
> // extract bin split points
> how to do it   <- ?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark ml how to extract split points from trained decision tree mode

2020-06-11 Thread AaronLee
I am following  official spark 2.4.3 tutorial

  
trained a decision tree model. How to extract split points from the trained
model?

// model
val dt = new DecisionTreeClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxBins(10)

// Train model.  This also runs the indexers.
val dtm = dt.fit(trainingData)

// extract bin split points
how to do it   <- ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[External] Unsubscribe

2020-06-11 Thread Mishra, Dhiraj A.

Thanks,
Dhiraj



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com


Unsubscribe

2020-06-11 Thread Angel Angel



Re: Broadcast join data reuse

2020-06-11 Thread Ankur Srivastava
Hi Tyson,

The broadcast variable should remain in-memory of the executors and reused
unless you unpersist, destroy it or it goes out of context.

Hope this helps.

Thanks
Ankur

On Wed, Jun 10, 2020 at 5:28 PM  wrote:

> We have a case where data the is small enough to be broadcasted in joined
> with multiple tables in a single plan. Looking at the physical plan, I do
> not see anything that indicates if the broadcast data is done only once
> i.e., the BroadcastExchange is being reused i.i.e., that data is not
> redistributed from scratch. Could someone with insight into the physical
> plan strategy for such a case confirm whether previous broadcasted data is
> reused or if subsequent BroadcastExechange steps are done from scratch.
>
>
>
> Thanks and best regards,
>
> Tyson
>


Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-06-11 Thread Tanveer Ahmad - EWI
Hi Jorge,


Thank you. This union function is better alternative for my work.


Regards,
Tanveer Ahmad



From: Jorge Machado 
Sent: Monday, May 25, 2020 3:56:04 PM
To: Tanveer Ahmad - EWI
Cc: Spark Group
Subject: Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark 
Dataframe conversion in streaming fashion

Hey, from what I know you can try to Union them df.union(df2)

Not sure if this is what you need

On 25. May 2020, at 13:53, Tanveer Ahmad - EWI 
mailto:t.ah...@tudelft.nl>> wrote:

Hi all,

I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow 
enabled) Spark Dataframe conversions.
Here the example explains very well how to convert a single Pandas Dataframe to 
Spark Dataframe [1].

But in my case, some external applications are generating Arrow RecordBatches 
in my PySpark application in streaming fashion. Each time I receive an Arrow 
RB, I want to transfer/append it to a Spark Dataframe. So is it possible to 
create a Spark Dataframe initially from one Arrow RecordBatch and then start 
appending many other in-coming Arrow RecordBatches to that Spark Dataframe 
(like in streaming fashion)? Thanks!

I saw another example [2] in which all the Arrow RB are being converted to 
Spark Dataframe but my case is little bit different than this.

[1] https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html

[2] 
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

---
Regards,
Tanveer Ahmad