How to change column values using several when conditions ?

2023-04-30 Thread marc nicole
Hello to you Sparkling community :)

I want to change values of a column in a dataset according to a mapping
list that maps original values of that column to other new values. Each
element of the list (colMappingValues) is a string that separates the
original values from the new values using a ";".

So for a given column (in the following example colName), I do the
following processing to alter the column values as described:

for (i=0;i
> //below lists contains all distinct values of a column
> (colMappingValues[i]) and their target values)
> allValuesChanges = colMappingValues[i].toString().split(";", 2);
>
>  dataset  = dataset.withColumn(colName,
> when(dataset.col(colName).equalTo(allValuesChanges[0])),allValuesChanges[1]).otherwise(dataset.col(colName));

}

which is working but I want it to be efficient to avoid unnecessary
iterations. Meaning that I want when the column doesn't contain the value
from the list, the call to withColumn() gets ignored.
How to do exactly that in a more efficient way using Spark in Java?

Thanks.


Any experience with K8s Remote Shuffling Service at scale?

2023-04-30 Thread Andrey Gourine
Hi All, I am looking for people that have experience running external
shuffling service at scale with Spark 3 and K8s

I have already tried internal shuffling service (available from spark 3)
and trying to work with Uniffle
 (Incubating)
Any other options?
Thank you


Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
There is a large overhead to distributing this type of workload. I imagine
that for a small problem, the overhead dominates. You do not nearly need to
distribute a problem of this size, so more workers is probalby just worse.

On Sun, Apr 30, 2023 at 1:46 AM second_co...@yahoo.com <
second_co...@yahoo.com> wrote:

> I re-test with cifar10 example and below is the result .  can advice why
> lesser num_slot is faster compared with more slots?
>
> num_slots=20
>
> 231 seconds
>
>
> num_slots=5
>
> 52 seconds
>
>
> num_slot=1
>
> 34 seconds
>
> the code is at below
> https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32
>
> Do you have an example of tensorflow+big dataset that I can test?
>
>
>
>
>
>
>
> On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen <
> sro...@gmail.com> wrote:
>
>
> You don't want to use CPUs with Tensorflow.
> If it's not scaling, you may have a problem that is far too small to
> distribute.
>
> On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID
>  wrote:
>
> Anyone successfully run native tensorflow on Spark ? i tested example at
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
> on Kubernetes CPU . By running in on multiple workers CPUs. I do not see
> any speed up in training time by setting number of slot from1 to 10. The
> time taken to train is still the same. Anyone tested tensorflow training on
> Spark distributed workers with CPUs ?  Can share your working example?
>
>
>
>
>
>


How to read text files with GBK encoding in the spark core

2023-04-30 Thread lianyou1...@126.com
Hello all,

Is there any way to use the pyspark core to read some text files with GBK 
encoding?
Although the pyspark sql has an option to set the encoding, but these text 
files are not structural format.
Any advices are appreciated.

Thank you
lianyou Li


Re: Tensorflow on Spark CPU

2023-04-30 Thread second_co...@yahoo.com.INVALID
 I re-test with cifar10 example and below is the result .  can advice why 
lesser num_slot is faster compared with more slots?
num_slots=20 231 seconds
num_slots=5 52 seconds
num_slot=134 seconds

the code is at below 
https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32
Do you have an example of tensorflow+big dataset that I can test?






On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen 
 wrote:  
 
 You don't want to use CPUs with Tensorflow.If it's not scaling, you may have a 
problem that is far too small to distribute.
On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID 
 wrote:

Anyone successfully run native tensorflow on Spark ? i tested example at 
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor
  on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any 
speed up in training time by setting number of slot from1 to 10. The time taken 
to train is still the same. Anyone tested tensorflow training on Spark 
distributed workers with CPUs ?  Can share your working example?