How to improve efficiency of this piece of code (returning distinct column values)

sam smith Fri, 10 Feb 2023 13:34:51 -0800

I want to get the distinct values of each column in a List (is it good
practice to use List here?), that contains as first element the column
name, and the other element its distinct values so that for a dataset we
get a list of lists, i do it this way (in my opinion no so fast):


List<List<String>> finalList = new ArrayList<List<String>>();
    Dataset<Row> df = spark.read().format("csv").option("header",
"true").load("/pathToCSV");
    String[] columnNames = df.columns();
 for (int i=0;i<columnNames.length;i++) {
    List<String> columnList = new ArrayList<String>();

    columnList.add(columnNames[i]);


    List<Row> columnValues =
df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
    for (int j=0;j<columnValues.size();j++)
        columnList.add(columnValues.get(j).apply(0).toString());

    finalList.add(columnList);


How to improve this?

Also, can I get the results in JSON format?

How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to