read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole
Hi, Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a dataset that is found in one node of the cluster and not in the others, how to tell Spark that? I expect through DataframeReader and using path like *IP:port/pathOnLocalNode* PS: loading the dataset in HDFS is not an

Change column values using several when conditions

2023-05-01 Thread marc nicole
Hello I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a ";". So for a given column (in

How to change column values using several when conditions ?

2023-04-30 Thread marc nicole
Hello to you Sparkling community :) I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a

Re: input file size

2022-06-19 Thread marc nicole
Reasoning in files (vs datasets as i first thought of this question), I think this is more adequate in Spark: > org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); it will yield same result as > new File("filePath").length(); Le dim. 19 juin 2022 à 11:11, Enrico Minack a

Re: input file size

2022-06-18 Thread marc nicole
Hi, I found this ( https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html) that may be helpful, i use Java: > org.apache.spark.util.SizeEstimator.estimate(dataset)); Le sam. 18 juin 2022 à 22:33, mbreuer a écrit : > Hello Community, > > I am working on

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
n. Anyways thanks guys! Le ven. 17 juin 2022 à 22:35, marc nicole a écrit : > String dateString = String.format("%d-%02d-%02d", 2012, 02, 03); > Date sqlDate = java.sql.Date.valueOf(dateString); > dataset= > dataset.where(to_date(dataset.col("Date"),"MM-dd-&qu

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
this Le ven. 17 juin 2022 à 22:13, marc nicole a écrit : > @Stelios : to_date requires column type > @Sean how to parse a literal to a date lit("02-03-2012").cast("date")? > > Le ven. 17 juin 2022 à 22:07, Stelios Philippou a > écrit : > >> datas

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
2012", > "MM-dd-")); > > On Fri, 17 Jun 2022, 22:51 marc nicole, wrote: > >> dataset = >> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date")); >> ? >> This is return

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
econd part and don't forget to cast it as well > > On Fri, 17 Jun 2022, 22:08 marc nicole, wrote: > >> should i cast to date the target date then? for example maybe: >> >> dataset = >>> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq(

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
en a écrit : > Look at your query again. You are comparing dates to strings. The dates > widen back to strings. > > On Fri, Jun 17, 2022, 1:39 PM marc nicole wrote: > >> I also tried: >> >> dataset = >>> dataset.where(to_date(dataset.col("Date"),

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
5 as a > string is before 02-03-2012. > You apply date function to dates, not strings. > You have to parse the dates properly, which was the problem in your last > email. > > On Fri, Jun 17, 2022 at 12:58 PM marc nicole wrote: > >> Hello, >> >> I have a dataset

how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole
Hello, I have a dataset containing a column of dates, which I want to use for filtering. Nothing, from what I have tried, seems to return the exact right solution. Here's my input: + + |Date| ++ | 02-08-2019 | ++ | 02-07-2019 |

Re: How to recognize and get the min of a date/string column in Java?

2022-06-15 Thread marc nicole
finally solved with the MM for months format recommendation. thanks Le mar. 14 juin 2022 à 23:02, marc nicole a écrit : > i changed the format to -mm-dd for the example > > Le mar. 14 juin 2022 à 22:52, Sean Owen a écrit : > >> Look at your data - doesn't match da

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
i changed the format to -mm-dd for the example Le mar. 14 juin 2022 à 22:52, Sean Owen a écrit : > Look at your data - doesn't match date format you give > > On Tue, Jun 14, 2022, 3:41 PM marc nicole wrote: > >> for the input (I changed the format) : >> >>

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
| ++ | 2022-02-08 | ++ the output was 2012-01-03 To note that for my below code to work I cast to string the resulting min column. Le mar. 14 juin 2022 à 21:12, Sean Owen a écrit : > You haven't shown your input or the result > > On Tue, Jun 14, 2022 at 1:40 PM marc nico

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
ically. > > Small note, MM is month. mm is minute. You have to fix that for this to > work. These are Java format strings. > > On Tue, Jun 14, 2022, 12:32 PM marc nicole wrote: > >> Hi, >> >> I want to identify a column of dates as such, the column has formatt

How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole
Hi, I want to identify a column of dates as such, the column has formatted strings in the likes of: "06-14-2022" (the format being mm-dd-) and get the minimum of those dates. I tried in Java as follows: if (dataset.filter(org.apache.spark.sql.functions.to_date( > dataset.col(colName),

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
' refers > to "no value": > > spark.read > .option("inferSchema", "true") > .option("header", "true") > .option("nullvalue", "+") > .csv("path") > > Enrico > > > Am 04

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
.csv() method. Any better idea to do that ? Le sam. 4 juin 2022 à 18:40, Enrico Minack a écrit : > Can you provide an example string (row) and the expected inferred schema? > > Enrico > > > Am 04.06.22 um 18:36 schrieb marc nicole: > > How to do just that? i thought we only can

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
truing > the entire row as a string like "Row[foo=bar, baz=1]" > > On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: > >> Hi Sean, >> >> Thanks, actually I have a dataset where I want to inferSchema after >> discarding the specific String value of &quo

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
y this way. > You can use a UDF to call .toString on the Row of course, but, again > what are you really trying to do? > > On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: > >> Hi, >> How to convert a Dataset to a Dataset? >> What i have tried is: >> >> List li

How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi, How to convert a Dataset to a Dataset? What i have tried is: List list = dataset.as(Encoders.STRING()).collectAsList(); Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line raises a org.apache.spark.sql.AnalysisException: Try to map struct... to Tuple1, but

approx_count_distinct in spark always return 1

2022-06-02 Thread marc nicole
I have a dataset where i want to count distinct values for column based a group of others, i do it like so, processedDataset = processedDataset.withColumn("freq", approx_count_distinct("col1").over(Window.partitionBy(groupCols.toArray(new Column[groupCols.size()]; but even when i have

Re: Unable to convert double values

2022-05-29 Thread marc nicole
so sorry , the matching pattern is rather '^\d*[.]\d*$' Le dim. 29 mai 2022 à 19:58, marc nicole a écrit : > Hi, > > I think this part of your first line of code* > ...regexp_replace(col("annual_salary"), "\.", "") *is messing things up, > so try t

Re: Unable to convert double values

2022-05-29 Thread marc nicole
Hi, I think this part of your first line of code* ...regexp_replace(col("annual_salary"), "\.", "") *is messing things up, so try to remove it. Also try to use this numerical matching pattern '^[0-9]*$' in your code instead Le dim. 29 mai 2022 à 19:24, Sid a écrit : > Hi Team, > > I need

k-anonymity with Spark in Java

2022-05-28 Thread marc nicole
Hi Spark devs, Anybody willing to check my code implementing *k-anonymity*? public static Dataset < Row > kAnonymizeBySuppression(SparkSession sparksession, Dataset < Row > initDataset, List < String > qidAtts, Integer k_anonymity_constant) { Dataset < Row > anonymizedDF =

Grouping and counting occurences of specific column rows

2022-04-22 Thread marc nicole
Hi all, Sorry for posting this twice, I need to know how to group by several column attributes (e.g.,List groupByAttributes) a dataset (dataset) and then count the occurrences of associated grouped rows, how do i achieve that ? I tried through the following code: > Dataset groupedRows =

Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
. 2022 à 14:06, Sean Owen a écrit : > Just .groupBy(...).count() ? > > On Tue, Apr 19, 2022 at 6:24 AM marc nicole wrote: > >> Hello guys, >> >> I want to group by certain column attributes (e.g.,List >> groupByQidAttributes) a dataset (initDataset) and then co

Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
Hello guys, I want to group by certain column attributes (e.g.,List groupByQidAttributes) a dataset (initDataset) and then count the occurrences of associated grouped rows, how do i achieve that neatly? I tried through the following code: Dataset groupedRowsDF =

Please Review My Code

2022-04-16 Thread marc nicole
Hello Guys, I want you to review my code available in this Github repo: https://github.com/MNicole12/AlgorithmForReview/blob/main/codeReview.java Thanks in advance for your improving comments. Marc.