read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi,

Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a
dataset that is found in one node of the cluster and not in the others, how
to tell Spark that?

I expect through DataframeReader and using path like
*IP:port/pathOnLocalNode*

PS: loading the dataset in HDFS is not an option.

Thanks


Change column values using several when conditions

2023-05-01 Thread marc nicole
Hello

I want to change values of a column in a dataset according to a mapping
list that maps original values of that column to other new values. Each
element of the list (colMappingValues) is a string that separates the
original values from the new values using a ";".

So for a given column (in the following example colName), I do the
following processing to alter the column values as described:

for (i=0;i
> //below lists contains all distinct values of a column
> (colMappingValues[i]) and their target values)
> allValuesChanges = colMappingValues[i].toString().split(";", 2);
>
>  dataset  = dataset.withColumn(colName,
> when(dataset.col(colName).equalTo(allValuesChanges[0])),allValuesChanges[1]).otherwise(dataset.col(colName));

}

which is working but I want it to be efficient to avoid unnecessary
iterations. Meaning that I want when the column doesn't contain the value
from the list, the call to withColumn() gets ignored.
How to do exactly that in a more efficient way using Spark in Java?


How to change column values using several when conditions ?

2023-04-30 Thread marc nicole
Hello to you Sparkling community :)

I want to change values of a column in a dataset according to a mapping
list that maps original values of that column to other new values. Each
element of the list (colMappingValues) is a string that separates the
original values from the new values using a ";".

So for a given column (in the following example colName), I do the
following processing to alter the column values as described:

for (i=0;i
> //below lists contains all distinct values of a column
> (colMappingValues[i]) and their target values)
> allValuesChanges = colMappingValues[i].toString().split(";", 2);
>
>  dataset  = dataset.withColumn(colName,
> when(dataset.col(colName).equalTo(allValuesChanges[0])),allValuesChanges[1]).otherwise(dataset.col(colName));

}

which is working but I want it to be efficient to avoid unnecessary
iterations. Meaning that I want when the column doesn't contain the value
from the list, the call to withColumn() gets ignored.
How to do exactly that in a more efficient way using Spark in Java?

Thanks.


Re: input file size

2022-06-19 Thread marc nicole
Reasoning in files (vs datasets as i first thought of this question), I
think this is more adequate in Spark:

> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null);

it will yield same result as

> new File("filePath").length();


Le dim. 19 juin 2022 à 11:11, Enrico Minack  a
écrit :

> Maybe a
>
>   .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else 
> Iterator.empty)
>
> might be faster than the
>
>   .distinct.as[String]
>
>
> Enrico
>
>
> Am 19.06.22 um 08:59 schrieb Enrico Minack:
>
> Given you already know your input files (input_file_name), why not getting
> their size and summing this up?
>
> import java.io.Fileimport java.net.URIimport 
> org.apache.spark.sql.functions.input_file_name
> ds.select(input_file_name.as("filename"))
>   .distinct.as[String]
>   .map(filename => new File(new URI(filename).getPath).length)
>   .select(sum($"value"))
>   .show()
>
>
> Enrico
>
>
> Am 19.06.22 um 03:16 schrieb Yong Walt:
>
> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = 
> someFile.length
>
> This one?
>
>
> On Sun, Jun 19, 2022 at 4:33 AM mbreuer  wrote:
>
>> Hello Community,
>>
>> I am working on optimizations for file sizes and number of files. In the
>> data frame there is a function input_file_name which returns the file
>> name. I miss a counterpart to get the size of the file. Just the size,
>> like "ls -l" returns. Is there something like that?
>>
>> Kind regards,
>> Markus
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


Re: input file size

2022-06-18 Thread marc nicole
Hi,

I found this (
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html)
that may be helpful, i use Java:

> org.apache.spark.util.SizeEstimator.estimate(dataset));







Le sam. 18 juin 2022 à 22:33, mbreuer  a écrit :

> Hello Community,
>
> I am working on optimizations for file sizes and number of files. In the
> data frame there is a function input_file_name which returns the file
> name. I miss a counterpart to get the size of the file. Just the size,
> like "ls -l" returns. Is there something like that?
>
> Kind regards,
> Markus
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
 I finally settled for:

dataset= dataset.where(to_date( dataset.col("Date"),"MM-dd-").geq(new
java.sql.Date(new
SimpleDateFormat("MM-dd-").parse("02-03-2012").getTime(;
which seems overly complicated, I was hoping for a simpler Spark solution.
Anyways thanks guys!

Le ven. 17 juin 2022 à 22:35, marc nicole  a écrit :

> String dateString = String.format("%d-%02d-%02d", 2012, 02, 03);
> Date sqlDate = java.sql.Date.valueOf(dateString);
> dataset=
> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq(sqlDate));
>
> Is the only way I found to make it work, I am sure there's better than
> this
>
> Le ven. 17 juin 2022 à 22:13, marc nicole  a écrit :
>
>> @Stelios : to_date requires column type
>> @Sean how to parse a literal to a date lit("02-03-2012").cast("date")?
>>
>> Le ven. 17 juin 2022 à 22:07, Stelios Philippou  a
>> écrit :
>>
>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq(to_date("02-03-2012",
>>> "MM-dd-"));
>>>
>>> On Fri, 17 Jun 2022, 22:51 marc nicole,  wrote:
>>>
>>>> dataset =
>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>>> ?
>>>> This is returning an empty dataset.
>>>>
>>>> Le ven. 17 juin 2022 à 21:34, Stelios Philippou  a
>>>> écrit :
>>>>
>>>>> You are already doing it once.
>>>>> to_date the second part and don't forget to cast it as well
>>>>>
>>>>> On Fri, 17 Jun 2022, 22:08 marc nicole,  wrote:
>>>>>
>>>>>> should i cast to date the target date then? for example maybe:
>>>>>>
>>>>>> dataset =
>>>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>>>>>> ?
>>>>>>
>>>>>> How to to do that ? comparing with dates?
>>>>>>
>>>>>>
>>>>>> Le ven. 17 juin 2022 à 20:52, Sean Owen  a écrit :
>>>>>>
>>>>>>> Look at your query again. You are comparing dates to strings. The
>>>>>>> dates widen back to strings.
>>>>>>>
>>>>>>> On Fri, Jun 17, 2022, 1:39 PM marc nicole 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I also tried:
>>>>>>>>
>>>>>>>> dataset =
>>>>>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));
>>>>>>>>
>>>>>>>>
>>>>>>>> But it returned an empty dataset.
>>>>>>>>
>>>>>>>> Le ven. 17 juin 2022 à 20:28, Sean Owen  a
>>>>>>>> écrit :
>>>>>>>>
>>>>>>>>> Same answer as last time - those are strings, not dates.
>>>>>>>>> 02-02-2015 as a string is before 02-03-2012.
>>>>>>>>> You apply date function to dates, not strings.
>>>>>>>>> You have to parse the dates properly, which was the problem in
>>>>>>>>> your last email.
>>>>>>>>>
>>>>>>>>> On Fri, Jun 17, 2022 at 12:58 PM marc nicole 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have a dataset containing a column of dates, which I want to
>>>>>>>>>> use for filtering. Nothing, from what I have tried, seems to return 
>>>>>>>>>> the
>>>>>>>>>> exact right solution.
>>>>>>>>>> Here's my input:
>>>>>>>>>>
>>>>>>>>>> +   +
>>>>>>>>>> |Date|
>>>>>>>>>> ++
>>>>>>>>>> | 02-08-2019 |
>>>>>>>>>> ++
>>>>>>>>>> | 02-07-2019 |
>>>>>>>>>> ++
>>>>>>>>>> | 12-01-2019 |
>>>>>>>>>> ++
>>>>>>>>>> | 02-02-2015 |
>>>>>>>>>> ++
>>>>>>>>>> | 02-03-2012 |
>>>>>>>>>> ++
>>>>>>>>>> | 05-06-2018 |
>>>>>>>>>> ++
>>>>>>>>>> | 02-08-2022 |
>>>>>>>>>> ++
>>>>>>>>>>
>>>>>>>>>> The code that i have tried (always giving missing dates in the
>>>>>>>>>> result):
>>>>>>>>>>
>>>>>>>>>> dataset = dataset.filter(
>>>>>>>>>>> dataset.col("Date").geq("02-03-2012"));  // not showing the date of
>>>>>>>>>>> *02-02-2015*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I tried to apply *date_trunc()* with the first parameter "day"
>>>>>>>>>> but nothing.
>>>>>>>>>>
>>>>>>>>>> I have also compared a converted column (using *to_date()*) with
>>>>>>>>>> a *literal *of the target date but always returning an empty
>>>>>>>>>> dataset.
>>>>>>>>>>
>>>>>>>>>> How to do that in Java ?
>>>>>>>>>>
>>>>>>>>>>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
String dateString = String.format("%d-%02d-%02d", 2012, 02, 03);
Date sqlDate = java.sql.Date.valueOf(dateString);
dataset=
dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq(sqlDate));

Is the only way I found to make it work, I am sure there's better than
this

Le ven. 17 juin 2022 à 22:13, marc nicole  a écrit :

> @Stelios : to_date requires column type
> @Sean how to parse a literal to a date lit("02-03-2012").cast("date")?
>
> Le ven. 17 juin 2022 à 22:07, Stelios Philippou  a
> écrit :
>
>> dataset.where(to_date(dataset.col("Date"),"MM-dd-yyyy").geq(to_date("02-03-2012",
>> "MM-dd-"));
>>
>> On Fri, 17 Jun 2022, 22:51 marc nicole,  wrote:
>>
>>> dataset =
>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>> ?
>>> This is returning an empty dataset.
>>>
>>> Le ven. 17 juin 2022 à 21:34, Stelios Philippou  a
>>> écrit :
>>>
>>>> You are already doing it once.
>>>> to_date the second part and don't forget to cast it as well
>>>>
>>>> On Fri, 17 Jun 2022, 22:08 marc nicole,  wrote:
>>>>
>>>>> should i cast to date the target date then? for example maybe:
>>>>>
>>>>> dataset =
>>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>>>>> ?
>>>>>
>>>>> How to to do that ? comparing with dates?
>>>>>
>>>>>
>>>>> Le ven. 17 juin 2022 à 20:52, Sean Owen  a écrit :
>>>>>
>>>>>> Look at your query again. You are comparing dates to strings. The
>>>>>> dates widen back to strings.
>>>>>>
>>>>>> On Fri, Jun 17, 2022, 1:39 PM marc nicole 
>>>>>> wrote:
>>>>>>
>>>>>>> I also tried:
>>>>>>>
>>>>>>> dataset =
>>>>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));
>>>>>>>
>>>>>>>
>>>>>>> But it returned an empty dataset.
>>>>>>>
>>>>>>> Le ven. 17 juin 2022 à 20:28, Sean Owen  a écrit :
>>>>>>>
>>>>>>>> Same answer as last time - those are strings, not dates. 02-02-2015
>>>>>>>> as a string is before 02-03-2012.
>>>>>>>> You apply date function to dates, not strings.
>>>>>>>> You have to parse the dates properly, which was the problem in your
>>>>>>>> last email.
>>>>>>>>
>>>>>>>> On Fri, Jun 17, 2022 at 12:58 PM marc nicole 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I have a dataset containing a column of dates, which I want to use
>>>>>>>>> for filtering. Nothing, from what I have tried, seems to return the 
>>>>>>>>> exact
>>>>>>>>> right solution.
>>>>>>>>> Here's my input:
>>>>>>>>>
>>>>>>>>> +   +
>>>>>>>>> |Date|
>>>>>>>>> ++
>>>>>>>>> | 02-08-2019 |
>>>>>>>>> ++
>>>>>>>>> | 02-07-2019 |
>>>>>>>>> ++
>>>>>>>>> | 12-01-2019 |
>>>>>>>>> ++
>>>>>>>>> | 02-02-2015 |
>>>>>>>>> ++
>>>>>>>>> | 02-03-2012 |
>>>>>>>>> ++
>>>>>>>>> | 05-06-2018 |
>>>>>>>>> ++
>>>>>>>>> | 02-08-2022 |
>>>>>>>>> ++
>>>>>>>>>
>>>>>>>>> The code that i have tried (always giving missing dates in the
>>>>>>>>> result):
>>>>>>>>>
>>>>>>>>> dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));
>>>>>>>>>> // not showing the date of *02-02-2015*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I tried to apply *date_trunc()* with the first parameter "day"
>>>>>>>>> but nothing.
>>>>>>>>>
>>>>>>>>> I have also compared a converted column (using *to_date()*) with
>>>>>>>>> a *literal *of the target date but always returning an empty
>>>>>>>>> dataset.
>>>>>>>>>
>>>>>>>>> How to do that in Java ?
>>>>>>>>>
>>>>>>>>>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
@Stelios : to_date requires column type
@Sean how to parse a literal to a date lit("02-03-2012").cast("date")?

Le ven. 17 juin 2022 à 22:07, Stelios Philippou  a
écrit :

> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq(to_date("02-03-2012",
> "MM-dd-"));
>
> On Fri, 17 Jun 2022, 22:51 marc nicole,  wrote:
>
>> dataset =
>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>> ?
>> This is returning an empty dataset.
>>
>> Le ven. 17 juin 2022 à 21:34, Stelios Philippou  a
>> écrit :
>>
>>> You are already doing it once.
>>> to_date the second part and don't forget to cast it as well
>>>
>>> On Fri, 17 Jun 2022, 22:08 marc nicole,  wrote:
>>>
>>>> should i cast to date the target date then? for example maybe:
>>>>
>>>> dataset =
>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>>>> ?
>>>>
>>>> How to to do that ? comparing with dates?
>>>>
>>>>
>>>> Le ven. 17 juin 2022 à 20:52, Sean Owen  a écrit :
>>>>
>>>>> Look at your query again. You are comparing dates to strings. The
>>>>> dates widen back to strings.
>>>>>
>>>>> On Fri, Jun 17, 2022, 1:39 PM marc nicole  wrote:
>>>>>
>>>>>> I also tried:
>>>>>>
>>>>>> dataset =
>>>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));
>>>>>>
>>>>>>
>>>>>> But it returned an empty dataset.
>>>>>>
>>>>>> Le ven. 17 juin 2022 à 20:28, Sean Owen  a écrit :
>>>>>>
>>>>>>> Same answer as last time - those are strings, not dates. 02-02-2015
>>>>>>> as a string is before 02-03-2012.
>>>>>>> You apply date function to dates, not strings.
>>>>>>> You have to parse the dates properly, which was the problem in your
>>>>>>> last email.
>>>>>>>
>>>>>>> On Fri, Jun 17, 2022 at 12:58 PM marc nicole 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I have a dataset containing a column of dates, which I want to use
>>>>>>>> for filtering. Nothing, from what I have tried, seems to return the 
>>>>>>>> exact
>>>>>>>> right solution.
>>>>>>>> Here's my input:
>>>>>>>>
>>>>>>>> +   +
>>>>>>>> |Date|
>>>>>>>> ++
>>>>>>>> | 02-08-2019 |
>>>>>>>> ++
>>>>>>>> | 02-07-2019 |
>>>>>>>> ++
>>>>>>>> | 12-01-2019 |
>>>>>>>> ++
>>>>>>>> | 02-02-2015 |
>>>>>>>> ++
>>>>>>>> | 02-03-2012 |
>>>>>>>> ++
>>>>>>>> | 05-06-2018 |
>>>>>>>> ++
>>>>>>>> | 02-08-2022 |
>>>>>>>> ++
>>>>>>>>
>>>>>>>> The code that i have tried (always giving missing dates in the
>>>>>>>> result):
>>>>>>>>
>>>>>>>> dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));
>>>>>>>>> // not showing the date of *02-02-2015*
>>>>>>>>
>>>>>>>>
>>>>>>>> I tried to apply *date_trunc()* with the first parameter "day" but
>>>>>>>> nothing.
>>>>>>>>
>>>>>>>> I have also compared a converted column (using *to_date()*) with a
>>>>>>>> *literal *of the target date but always returning an empty dataset.
>>>>>>>>
>>>>>>>> How to do that in Java ?
>>>>>>>>
>>>>>>>>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
dataset =
dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
?
This is returning an empty dataset.

Le ven. 17 juin 2022 à 21:34, Stelios Philippou  a
écrit :

> You are already doing it once.
> to_date the second part and don't forget to cast it as well
>
> On Fri, 17 Jun 2022, 22:08 marc nicole,  wrote:
>
>> should i cast to date the target date then? for example maybe:
>>
>> dataset =
>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
>>> ?
>>
>> How to to do that ? comparing with dates?
>>
>>
>> Le ven. 17 juin 2022 à 20:52, Sean Owen  a écrit :
>>
>>> Look at your query again. You are comparing dates to strings. The dates
>>> widen back to strings.
>>>
>>> On Fri, Jun 17, 2022, 1:39 PM marc nicole  wrote:
>>>
>>>> I also tried:
>>>>
>>>> dataset =
>>>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));
>>>>
>>>>
>>>> But it returned an empty dataset.
>>>>
>>>> Le ven. 17 juin 2022 à 20:28, Sean Owen  a écrit :
>>>>
>>>>> Same answer as last time - those are strings, not dates. 02-02-2015 as
>>>>> a string is before 02-03-2012.
>>>>> You apply date function to dates, not strings.
>>>>> You have to parse the dates properly, which was the problem in your
>>>>> last email.
>>>>>
>>>>> On Fri, Jun 17, 2022 at 12:58 PM marc nicole 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a dataset containing a column of dates, which I want to use
>>>>>> for filtering. Nothing, from what I have tried, seems to return the exact
>>>>>> right solution.
>>>>>> Here's my input:
>>>>>>
>>>>>> +   +
>>>>>> |Date|
>>>>>> ++
>>>>>> | 02-08-2019 |
>>>>>> ++
>>>>>> | 02-07-2019 |
>>>>>> ++
>>>>>> | 12-01-2019 |
>>>>>> ++
>>>>>> | 02-02-2015 |
>>>>>> ++
>>>>>> | 02-03-2012 |
>>>>>> ++
>>>>>> | 05-06-2018 |
>>>>>> ++
>>>>>> | 02-08-2022 |
>>>>>> ++
>>>>>>
>>>>>> The code that i have tried (always giving missing dates in the
>>>>>> result):
>>>>>>
>>>>>> dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));  //
>>>>>>> not showing the date of *02-02-2015*
>>>>>>
>>>>>>
>>>>>> I tried to apply *date_trunc()* with the first parameter "day" but
>>>>>> nothing.
>>>>>>
>>>>>> I have also compared a converted column (using *to_date()*) with a
>>>>>> *literal *of the target date but always returning an empty dataset.
>>>>>>
>>>>>> How to do that in Java ?
>>>>>>
>>>>>>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
should i cast to date the target date then? for example maybe:

dataset =
> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date"));
> ?

How to to do that ? comparing with dates?


Le ven. 17 juin 2022 à 20:52, Sean Owen  a écrit :

> Look at your query again. You are comparing dates to strings. The dates
> widen back to strings.
>
> On Fri, Jun 17, 2022, 1:39 PM marc nicole  wrote:
>
>> I also tried:
>>
>> dataset =
>>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));
>>
>>
>> But it returned an empty dataset.
>>
>> Le ven. 17 juin 2022 à 20:28, Sean Owen  a écrit :
>>
>>> Same answer as last time - those are strings, not dates. 02-02-2015 as a
>>> string is before 02-03-2012.
>>> You apply date function to dates, not strings.
>>> You have to parse the dates properly, which was the problem in your last
>>> email.
>>>
>>> On Fri, Jun 17, 2022 at 12:58 PM marc nicole 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a dataset containing a column of dates, which I want to use for
>>>> filtering. Nothing, from what I have tried, seems to return the exact right
>>>> solution.
>>>> Here's my input:
>>>>
>>>> +   +
>>>> |Date|
>>>> ++
>>>> | 02-08-2019 |
>>>> ++
>>>> | 02-07-2019 |
>>>> ++
>>>> | 12-01-2019 |
>>>> ++
>>>> | 02-02-2015 |
>>>> ++
>>>> | 02-03-2012 |
>>>> ++
>>>> | 05-06-2018 |
>>>> ++
>>>> | 02-08-2022 |
>>>> ++
>>>>
>>>> The code that i have tried (always giving missing dates in the result):
>>>>
>>>> dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));  //
>>>>> not showing the date of *02-02-2015*
>>>>
>>>>
>>>> I tried to apply *date_trunc()* with the first parameter "day" but
>>>> nothing.
>>>>
>>>> I have also compared a converted column (using *to_date()*) with a
>>>> *literal *of the target date but always returning an empty dataset.
>>>>
>>>> How to do that in Java ?
>>>>
>>>>


Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
I also tried:

dataset =
> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012"));


But it returned an empty dataset.

Le ven. 17 juin 2022 à 20:28, Sean Owen  a écrit :

> Same answer as last time - those are strings, not dates. 02-02-2015 as a
> string is before 02-03-2012.
> You apply date function to dates, not strings.
> You have to parse the dates properly, which was the problem in your last
> email.
>
> On Fri, Jun 17, 2022 at 12:58 PM marc nicole  wrote:
>
>> Hello,
>>
>> I have a dataset containing a column of dates, which I want to use for
>> filtering. Nothing, from what I have tried, seems to return the exact right
>> solution.
>> Here's my input:
>>
>> +   +
>> |Date|
>> ++
>> | 02-08-2019 |
>> ++
>> | 02-07-2019 |
>> ++
>> | 12-01-2019 |
>> ++
>> | 02-02-2015 |
>> ++
>> | 02-03-2012 |
>> ++
>> | 05-06-2018 |
>> ++
>> | 02-08-2022 |
>> ++
>>
>> The code that i have tried (always giving missing dates in the result):
>>
>> dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));  // not
>>> showing the date of *02-02-2015*
>>
>>
>> I tried to apply *date_trunc()* with the first parameter "day" but
>> nothing.
>>
>> I have also compared a converted column (using *to_date()*) with a
>> *literal *of the target date but always returning an empty dataset.
>>
>> How to do that in Java ?
>>
>>


how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
Hello,

I have a dataset containing a column of dates, which I want to use for
filtering. Nothing, from what I have tried, seems to return the exact right
solution.
Here's my input:

+   +
|Date|
++
| 02-08-2019 |
++
| 02-07-2019 |
++
| 12-01-2019 |
++
| 02-02-2015 |
++
| 02-03-2012 |
++
| 05-06-2018 |
++
| 02-08-2022 |
++

The code that i have tried (always giving missing dates in the result):

dataset = dataset.filter( dataset.col("Date").geq("02-03-2012"));  // not
> showing the date of *02-02-2015*


I tried to apply *date_trunc()* with the first parameter "day" but nothing.

I have also compared a converted column (using *to_date()*) with a
*literal *of the target date but always returning an empty dataset.

How to do that in Java ?


Re: How to recognize and get the min of a date/string column in Java?

2022-06-15 Thread marc nicole
finally solved with the MM for months format recommendation. thanks

Le mar. 14 juin 2022 à 23:02, marc nicole  a écrit :

> i changed the format to -mm-dd for the example
>
> Le mar. 14 juin 2022 à 22:52, Sean Owen  a écrit :
>
>> Look at your data - doesn't match date format you give
>>
>> On Tue, Jun 14, 2022, 3:41 PM marc nicole  wrote:
>>
>>> for the input  (I changed the format)  :
>>>
>>> +---+
>>> |Date|
>>> +---+
>>> | 2019-02-08 |
>>> ++
>>> | 2019-02-07 |
>>> ++
>>> | 2019-12-01 |
>>> ++
>>> | 2015-02-02 |
>>> ++
>>> | 2012-02-03 |
>>> ++
>>> | 2018-05-06 |
>>> ++
>>> | 2022-02-08 |
>>> ++
>>> the output was 2012-01-03
>>>
>>> To note that for my below code to work I cast to string the resulting
>>> min column.
>>>
>>> Le mar. 14 juin 2022 à 21:12, Sean Owen  a écrit :
>>>
>>>> You haven't shown your input or the result
>>>>
>>>> On Tue, Jun 14, 2022 at 1:40 PM marc nicole 
>>>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Even with MM for months it gives incorrect (but different this time)
>>>>> min value.
>>>>>
>>>>> Le mar. 14 juin 2022 à 20:18, Sean Owen  a écrit :
>>>>>
>>>>>> Yes that is right. It has to be parsed as a date to correctly reason
>>>>>> about ordering. Otherwise you are finding the minimum string
>>>>>> alphabetically.
>>>>>>
>>>>>> Small note, MM is month. mm is minute. You have to fix that for this
>>>>>> to work. These are Java format strings.
>>>>>>
>>>>>> On Tue, Jun 14, 2022, 12:32 PM marc nicole 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I want to identify a column of dates as such, the column has
>>>>>>> formatted strings in the likes of: "06-14-2022" (the format being
>>>>>>> mm-dd-) and get the minimum of those dates.
>>>>>>>
>>>>>>> I tried in Java as follows:
>>>>>>>
>>>>>>> if (dataset.filter(org.apache.spark.sql.functions.to_date(
>>>>>>>> dataset.col(colName), 
>>>>>>>> "mm-dd-").isNotNull()).select(colName).count() !=
>>>>>>>> 0) { 
>>>>>>>
>>>>>>>
>>>>>>> And to get the *min *of the column:
>>>>>>>
>>>>>>> Object colMin =
>>>>>>>> dataset.agg(org.apache.spark.sql.functions.min(org.apache.spark.sql.functions.to_date(dataset.col(colName),
>>>>>>>> "mm-dd-"))).first().get(0);
>>>>>>>
>>>>>>> // then I cast the *colMin *to string.
>>>>>>>
>>>>>>> To note that if i don't apply *to_date*() to the target column then
>>>>>>> the result will be erroneous (i think Spark will take the values as 
>>>>>>> string
>>>>>>> and will get the min as if it was applied on an alphabetical string).
>>>>>>>
>>>>>>> Any better approach to accomplish this?
>>>>>>> Thanks.
>>>>>>>
>>>>>>


Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
i changed the format to -mm-dd for the example

Le mar. 14 juin 2022 à 22:52, Sean Owen  a écrit :

> Look at your data - doesn't match date format you give
>
> On Tue, Jun 14, 2022, 3:41 PM marc nicole  wrote:
>
>> for the input  (I changed the format)  :
>>
>> +---+
>> |Date|
>> +---+
>> | 2019-02-08 |
>> ++
>> | 2019-02-07 |
>> ++
>> | 2019-12-01 |
>> ++
>> | 2015-02-02 |
>> ++
>> | 2012-02-03 |
>> ++
>> | 2018-05-06 |
>> ++
>> | 2022-02-08 |
>> ++
>> the output was 2012-01-03
>>
>> To note that for my below code to work I cast to string the resulting min
>> column.
>>
>> Le mar. 14 juin 2022 à 21:12, Sean Owen  a écrit :
>>
>>> You haven't shown your input or the result
>>>
>>> On Tue, Jun 14, 2022 at 1:40 PM marc nicole  wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Even with MM for months it gives incorrect (but different this time)
>>>> min value.
>>>>
>>>> Le mar. 14 juin 2022 à 20:18, Sean Owen  a écrit :
>>>>
>>>>> Yes that is right. It has to be parsed as a date to correctly reason
>>>>> about ordering. Otherwise you are finding the minimum string
>>>>> alphabetically.
>>>>>
>>>>> Small note, MM is month. mm is minute. You have to fix that for this
>>>>> to work. These are Java format strings.
>>>>>
>>>>> On Tue, Jun 14, 2022, 12:32 PM marc nicole 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I want to identify a column of dates as such, the column has
>>>>>> formatted strings in the likes of: "06-14-2022" (the format being
>>>>>> mm-dd-) and get the minimum of those dates.
>>>>>>
>>>>>> I tried in Java as follows:
>>>>>>
>>>>>> if (dataset.filter(org.apache.spark.sql.functions.to_date(
>>>>>>> dataset.col(colName), 
>>>>>>> "mm-dd-").isNotNull()).select(colName).count() !=
>>>>>>> 0) { 
>>>>>>
>>>>>>
>>>>>> And to get the *min *of the column:
>>>>>>
>>>>>> Object colMin =
>>>>>>> dataset.agg(org.apache.spark.sql.functions.min(org.apache.spark.sql.functions.to_date(dataset.col(colName),
>>>>>>> "mm-dd-"))).first().get(0);
>>>>>>
>>>>>> // then I cast the *colMin *to string.
>>>>>>
>>>>>> To note that if i don't apply *to_date*() to the target column then
>>>>>> the result will be erroneous (i think Spark will take the values as 
>>>>>> string
>>>>>> and will get the min as if it was applied on an alphabetical string).
>>>>>>
>>>>>> Any better approach to accomplish this?
>>>>>> Thanks.
>>>>>>
>>>>>


Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
for the input  (I changed the format)  :

+---+
|Date|
+---+
| 2019-02-08 |
++
| 2019-02-07 |
++
| 2019-12-01 |
++
| 2015-02-02 |
++
| 2012-02-03 |
++
| 2018-05-06 |
++
| 2022-02-08 |
++
the output was 2012-01-03

To note that for my below code to work I cast to string the resulting min
column.

Le mar. 14 juin 2022 à 21:12, Sean Owen  a écrit :

> You haven't shown your input or the result
>
> On Tue, Jun 14, 2022 at 1:40 PM marc nicole  wrote:
>
>> Hi Sean,
>>
>> Even with MM for months it gives incorrect (but different this time) min
>> value.
>>
>> Le mar. 14 juin 2022 à 20:18, Sean Owen  a écrit :
>>
>>> Yes that is right. It has to be parsed as a date to correctly reason
>>> about ordering. Otherwise you are finding the minimum string
>>> alphabetically.
>>>
>>> Small note, MM is month. mm is minute. You have to fix that for this to
>>> work. These are Java format strings.
>>>
>>> On Tue, Jun 14, 2022, 12:32 PM marc nicole  wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to identify a column of dates as such, the column has formatted
>>>> strings in the likes of: "06-14-2022" (the format being mm-dd-) and get
>>>> the minimum of those dates.
>>>>
>>>> I tried in Java as follows:
>>>>
>>>> if (dataset.filter(org.apache.spark.sql.functions.to_date(
>>>>> dataset.col(colName), "mm-dd-").isNotNull()).select(colName).count() 
>>>>> !=
>>>>> 0) { 
>>>>
>>>>
>>>> And to get the *min *of the column:
>>>>
>>>> Object colMin =
>>>>> dataset.agg(org.apache.spark.sql.functions.min(org.apache.spark.sql.functions.to_date(dataset.col(colName),
>>>>> "mm-dd-"))).first().get(0);
>>>>
>>>> // then I cast the *colMin *to string.
>>>>
>>>> To note that if i don't apply *to_date*() to the target column then
>>>> the result will be erroneous (i think Spark will take the values as string
>>>> and will get the min as if it was applied on an alphabetical string).
>>>>
>>>> Any better approach to accomplish this?
>>>> Thanks.
>>>>
>>>


Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
Hi Sean,

Even with MM for months it gives incorrect (but different this time) min
value.

Le mar. 14 juin 2022 à 20:18, Sean Owen  a écrit :

> Yes that is right. It has to be parsed as a date to correctly reason about
> ordering. Otherwise you are finding the minimum string alphabetically.
>
> Small note, MM is month. mm is minute. You have to fix that for this to
> work. These are Java format strings.
>
> On Tue, Jun 14, 2022, 12:32 PM marc nicole  wrote:
>
>> Hi,
>>
>> I want to identify a column of dates as such, the column has formatted
>> strings in the likes of: "06-14-2022" (the format being mm-dd-) and get
>> the minimum of those dates.
>>
>> I tried in Java as follows:
>>
>> if (dataset.filter(org.apache.spark.sql.functions.to_date(
>>> dataset.col(colName), "mm-dd-").isNotNull()).select(colName).count() !=
>>> 0) { 
>>
>>
>> And to get the *min *of the column:
>>
>> Object colMin =
>>> dataset.agg(org.apache.spark.sql.functions.min(org.apache.spark.sql.functions.to_date(dataset.col(colName),
>>> "mm-dd-"))).first().get(0);
>>
>> // then I cast the *colMin *to string.
>>
>> To note that if i don't apply *to_date*() to the target column then the
>> result will be erroneous (i think Spark will take the values as string and
>> will get the min as if it was applied on an alphabetical string).
>>
>> Any better approach to accomplish this?
>> Thanks.
>>
>


How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
Hi,

I want to identify a column of dates as such, the column has formatted
strings in the likes of: "06-14-2022" (the format being mm-dd-) and get
the minimum of those dates.

I tried in Java as follows:

if (dataset.filter(org.apache.spark.sql.functions.to_date(
> dataset.col(colName), "mm-dd-").isNotNull()).select(colName).count() !=
> 0) { 


And to get the *min *of the column:

Object colMin =
> dataset.agg(org.apache.spark.sql.functions.min(org.apache.spark.sql.functions.to_date(dataset.col(colName),
> "mm-dd-"))).first().get(0);

// then I cast the *colMin *to string.

To note that if i don't apply *to_date*() to the target column then the
result will be erroneous (i think Spark will take the values as string and
will get the min as if it was applied on an alphabetical string).

Any better approach to accomplish this?
Thanks.


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Yes, Thanks Enrico, that was greatly helpful!
To note that i was looking at some similar option at the docs but couldn't
stumble on one.
Thanks.

Le sam. 4 juin 2022 à 19:29, Enrico Minack  a
écrit :

> You could use .option("nullValue", "+") to tell the parser that '+' refers
> to "no value":
>
> spark.read
>  .option("inferSchema", "true")
>  .option("header", "true")
>  .option("nullvalue", "+")
>  .csv("path")
>
> Enrico
>
>
> Am 04.06.22 um 18:54 schrieb marc nicole:
>
> c1
>
> c2
>
> c3
>
> c4
>
> c5
>
> c6
>
> 1.2
>
> true
>
> A
>
> Z
>
> 120
>
> +
>
> 1.3
>
> false
>
> B
>
> X
>
> 130
>
> F
>
> +
>
> true
>
> C
>
> Y
>
> 200
>
> G
> in the above table c1 has double values except on the last row so:
>
> Dataset dataset =
> spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
> will yield StringType as a type for column c1 similarly for c6
> I want to return the true type of each column by first discarding the "+"
> I use Dataset after filtering the rows (removing "+") because i
> can re-read the new dataset using .csv() method.
> Any better idea to do that ?
>
> Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
> écrit :
>
>> Can you provide an example string (row) and the expected inferred schema?
>>
>> Enrico
>>
>>
>> Am 04.06.22 um 18:36 schrieb marc nicole:
>>
>> How to do just that? i thought we only can inferSchema when we first read
>> the dataset, or am i wrong?
>>
>> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>>
>>> It sounds like you want to interpret the input as strings, do some
>>> processing, then infer the schema. That has nothing to do with construing
>>> the entire row as a string like "Row[foo=bar, baz=1]"
>>>
>>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Thanks, actually I have a dataset where I want to inferSchema after
>>>> discarding the specific String value of "+". I do this because the column
>>>> would be considered StringType while if i remove that "+" value it will be
>>>> considered DoubleType for example or something else. Basically I want to
>>>> remove "+" from all dataset rows and then inferschema.
>>>> Here my idea is to filter the rows not equal to "+" for the target
>>>> columns (potentially all of them) and then use spark.read().csv() to read
>>>> the new filtered dataset with the option inferSchema which would then yield
>>>> correct column types.
>>>> What do you think?
>>>>
>>>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>>>
>>>>> I don't think you want to do that. You get a string representation of
>>>>> structured data without the structure, at best. This is part of the reason
>>>>> it doesn't work directly this way.
>>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>>> what are you really trying to do?
>>>>>
>>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> How to convert a Dataset to a Dataset?
>>>>>> What i have tried is:
>>>>>>
>>>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>>> up
>>>>>>
>>>>>> Type of columns being String
>>>>>> How to solve this?
>>>>>>
>>>>>
>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
c1

c2

c3

c4

c5

c6

1.2

true

A

Z

120

+

1.3

false

B

X

130

F

+

true

C

Y

200

G
in the above table c1 has double values except on the last row so:

Dataset dataset =
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset after filtering the rows (removing "+") because i can
re-read the new dataset using .csv() method.
Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
écrit :

> Can you provide an example string (row) and the expected inferred schema?
>
> Enrico
>
>
> Am 04.06.22 um 18:36 schrieb marc nicole:
>
> How to do just that? i thought we only can inferSchema when we first read
> the dataset, or am i wrong?
>
> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>
>> It sounds like you want to interpret the input as strings, do some
>> processing, then infer the schema. That has nothing to do with construing
>> the entire row as a string like "Row[foo=bar, baz=1]"
>>
>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks, actually I have a dataset where I want to inferSchema after
>>> discarding the specific String value of "+". I do this because the column
>>> would be considered StringType while if i remove that "+" value it will be
>>> considered DoubleType for example or something else. Basically I want to
>>> remove "+" from all dataset rows and then inferschema.
>>> Here my idea is to filter the rows not equal to "+" for the target
>>> columns (potentially all of them) and then use spark.read().csv() to read
>>> the new filtered dataset with the option inferSchema which would then yield
>>> correct column types.
>>> What do you think?
>>>
>>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>>
>>>> I don't think you want to do that. You get a string representation of
>>>> structured data without the structure, at best. This is part of the reason
>>>> it doesn't work directly this way.
>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>> what are you really trying to do?
>>>>
>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>>>
>>>>> Hi,
>>>>> How to convert a Dataset to a Dataset?
>>>>> What i have tried is:
>>>>>
>>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>> up
>>>>>
>>>>> Type of columns being String
>>>>> How to solve this?
>>>>>
>>>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
How to do just that? i thought we only can inferSchema when we first read
the dataset, or am i wrong?

Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

> It sounds like you want to interpret the input as strings, do some
> processing, then infer the schema. That has nothing to do with construing
> the entire row as a string like "Row[foo=bar, baz=1]"
>
> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>
>> Hi Sean,
>>
>> Thanks, actually I have a dataset where I want to inferSchema after
>> discarding the specific String value of "+". I do this because the column
>> would be considered StringType while if i remove that "+" value it will be
>> considered DoubleType for example or something else. Basically I want to
>> remove "+" from all dataset rows and then inferschema.
>> Here my idea is to filter the rows not equal to "+" for the target
>> columns (potentially all of them) and then use spark.read().csv() to read
>> the new filtered dataset with the option inferSchema which would then yield
>> correct column types.
>> What do you think?
>>
>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>
>>> I don't think you want to do that. You get a string representation of
>>> structured data without the structure, at best. This is part of the reason
>>> it doesn't work directly this way.
>>> You can use a UDF to call .toString on the Row of course, but, again
>>> what are you really trying to do?
>>>
>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>>
>>>> Hi,
>>>> How to convert a Dataset to a Dataset?
>>>> What i have tried is:
>>>>
>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>> up
>>>>
>>>> Type of columns being String
>>>> How to solve this?
>>>>
>>>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi Sean,

Thanks, actually I have a dataset where I want to inferSchema after
discarding the specific String value of "+". I do this because the column
would be considered StringType while if i remove that "+" value it will be
considered DoubleType for example or something else. Basically I want to
remove "+" from all dataset rows and then inferschema.
Here my idea is to filter the rows not equal to "+" for the target columns
(potentially all of them) and then use spark.read().csv() to read the new
filtered dataset with the option inferSchema which would then yield correct
column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :

> I don't think you want to do that. You get a string representation of
> structured data without the structure, at best. This is part of the reason
> it doesn't work directly this way.
> You can use a UDF to call .toString on the Row of course, but, again
> what are you really trying to do?
>
> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>
>> Hi,
>> How to convert a Dataset to a Dataset?
>> What i have tried is:
>>
>> List list = dataset.as(Encoders.STRING()).collectAsList();
>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>> map struct... to Tuple1, but failed as the number of fields does not line
>> up
>>
>> Type of columns being String
>> How to solve this?
>>
>


How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as(Encoders.STRING()).collectAsList();
Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
// But this line raises a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number of fields does not line
up

Type of columns being String
How to solve this?


approx_count_distinct in spark always return 1

2022-06-02 Thread marc nicole
I have a dataset where i want to count distinct values for column based a
group of others, i do it like so,

processedDataset = processedDataset.withColumn("freq",
approx_count_distinct("col1").over(Window.partitionBy(groupCols.toArray(new
Column[groupCols.size()];


but even when i have duplicate column values i still get 1 at the "freq"
column,

Also when i specify the rsd param to be 0 then i get arrayIndexOutOfBounds
kind of error.
Why?


Re: Unable to convert double values

2022-05-29 Thread marc nicole
so sorry , the matching pattern is rather '^\d*[.]\d*$'

Le dim. 29 mai 2022 à 19:58, marc nicole  a écrit :

> Hi,
>
> I think this part of your first line of code*
> ...regexp_replace(col("annual_salary"), "\.", "") *is messing things up,
> so try to remove it.
> Also try to use this numerical matching pattern '^[0-9]*$' in your code
> instead
>
>
>
> Le dim. 29 mai 2022 à 19:24, Sid  a écrit :
>
>> Hi Team,
>>
>> I need help with the below problem:
>>
>>
>> https://stackoverflow.com/questions/72422872/unable-to-format-double-values-in-pyspark?noredirect=1#comment127940175_72422872
>>
>>
>> What am I doing wrong?
>>
>> Thanks,
>> Siddhesh
>>
>


Re: Unable to convert double values

2022-05-29 Thread marc nicole
Hi,

I think this part of your first line of code*
...regexp_replace(col("annual_salary"), "\.", "") *is messing things up, so
try to remove it.
Also try to use this numerical matching pattern '^[0-9]*$' in your code
instead



Le dim. 29 mai 2022 à 19:24, Sid  a écrit :

> Hi Team,
>
> I need help with the below problem:
>
>
> https://stackoverflow.com/questions/72422872/unable-to-format-double-values-in-pyspark?noredirect=1#comment127940175_72422872
>
>
> What am I doing wrong?
>
> Thanks,
> Siddhesh
>


k-anonymity with Spark in Java

2022-05-28 Thread marc nicole
Hi Spark devs,

Anybody willing to check my code implementing *k-anonymity*?


public static Dataset < Row > kAnonymizeBySuppression(SparkSession
sparksession, Dataset < Row > initDataset, List < String > qidAtts, Integer
k_anonymity_constant) {

Dataset < Row > anonymizedDF = sparksession.emptyDataFrame();

Dataset < Row > tmpDF = sparksession.emptyDataFrame();
List < Column > groupByQidAttributes = qidAtts.stream().map(functions::
col).collect(Collectors.toList());

// groupBy and count each occurence.
Dataset < Row > groupedRowsDF = initDataset.withColumn("qidsFreqs",
count("*").over(Window.partitionBy(groupByQidAttributes.toArray(new Column[
groupByQidAttributes.size()];
Dataset < Row > rowsDeleteDF =
groupedRowsDF.select(col("*")).where("qidsFreqs
<" + k_anonymity_constant).toDF();
tmpDF = groupedRowsDF.select(col("*")).where("qidsFreqs >=" +
k_anonymity_constant).toDF();


for (String qidAtt: qidAtts) {
Dataset < Row > groupedRowsProcDF = rowsDeleteDF.withColumn(
"attFreq", approx_count_distinct(qidAtt).over(Window.partitionBy(
groupByQidAttributes.toArray(new Column[groupByQidAttributes.size()];

Dataset < Row > rowsDeleteDFUpdate = groupedRowsProcDF.select(col(
"*")).where("attFreq <" + k_anonymity_constant).toDF();

if (anonymizedDF.count() == 0)
anonymizedDF = rowsDeleteDFUpdate;
if (rowsDeleteDF.count() != 0) {
anonymizedDF = anonymizedDF.drop("attFreq").withColumn(qidAtt,
lit("*"));


}
}


return tmpDF.drop("qidsFreqs").union(anonymizedDF.drop("qidsFreqs"));
}



Thanks in advance for your improving comments.

Marc.


Grouping and counting occurences of specific column rows

2022-04-22 Thread marc nicole
Hi all,
Sorry for posting this twice,

I need to know how to group by several column attributes (e.g.,List groupByAttributes) a dataset (dataset) and then count the occurrences of
associated grouped rows, how do i achieve that ?
I tried through the following code:

> Dataset groupedRows = dataset.withColumn("freqs",
> count("*").over(Window.partitionBy(groupByAttributes.toArray(new
> Column[groupByAttributes.size()];

Would that do it?
To note:  I want the "grouped" rows to be separate for the subsequent
transformations so a groupBY is not adequate in this case.


Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
I don't want to groupBy since i want the rows separate for the subsequent
transformations. But i want to groupBy (i am using partitionBy here) using
many attributes while counting the frequency for each different group of
records (with respect to the the attributes first mentioned)

Le mar. 19 avr. 2022 à 14:06, Sean Owen  a écrit :

> Just .groupBy(...).count() ?
>
> On Tue, Apr 19, 2022 at 6:24 AM marc nicole  wrote:
>
>> Hello guys,
>>
>> I want to group by certain column attributes (e.g.,List
>> groupByQidAttributes) a dataset (initDataset) and then count the
>> occurrences of associated grouped rows, how do i achieve that neatly?
>> I tried through the following code:
>> Dataset groupedRowsDF = initDataset.withColumn("qidsFreqs", count(
>> "*").over(Window.partitionBy(groupByQidAttributes.toArray(new Column[
>> groupByQidAttributes.size()]; Is that OK to use for the purpose?
>>
>>


Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
Hello guys,

I want to group by certain column attributes (e.g.,List
groupByQidAttributes) a dataset (initDataset) and then count the
occurrences of associated grouped rows, how do i achieve that neatly?
I tried through the following code:
Dataset groupedRowsDF = initDataset.withColumn("qidsFreqs", count("*").
over(Window.partitionBy(groupByQidAttributes.toArray(new Column[
groupByQidAttributes.size()]; Is that OK to use for the purpose?


Please Review My Code

2022-04-16 Thread marc nicole
Hello Guys,

I want you to review my code available in this Github repo:
https://github.com/MNicole12/AlgorithmForReview/blob/main/codeReview.java
Thanks in advance for your improving comments.

Marc.