Why would csv or a temp table change anything here? You don't need windowing for distinct values either
On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > on top of my head, create a dataframe reading CSV file. > > This is python > > listing_df = > spark.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load(csv_file) > listing_df.printSchema() > listing_df.createOrReplaceTempView("temp") > > ## do your distinct columns using windowing functions on temp table with > SQL > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 10 Feb 2023 at 21:59, sam smith <qustacksm2123...@gmail.com> > wrote: > >> I am not sure i understand well " Just need to do the cols one at a >> time". Plus I think Apostolos is right, this needs a dataframe approach not >> a list approach. >> >> Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit : >> >>> For each column, select only that call and get distinct values. Similar >>> to what you do here. Just need to do the cols one at a time. Your current >>> code doesn't do what you want. >>> >>> On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com> >>> wrote: >>> >>>> Hi Sean, >>>> >>>> "You need to select the distinct values of each col one at a time", how >>>> ? >>>> >>>> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit : >>>> >>>>> That gives you all distinct tuples of those col values. You need to >>>>> select the distinct values of each col one at a time. Sure just collect() >>>>> the result as you do here. >>>>> >>>>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com> >>>>> wrote: >>>>> >>>>>> I want to get the distinct values of each column in a List (is it >>>>>> good practice to use List here?), that contains as first element the >>>>>> column >>>>>> name, and the other element its distinct values so that for a dataset we >>>>>> get a list of lists, i do it this way (in my opinion no so fast): >>>>>> >>>>>> List<List<String>> finalList = new ArrayList<List<String>>(); >>>>>> Dataset<Row> df = spark.read().format("csv").option("header", >>>>>> "true").load("/pathToCSV"); >>>>>> String[] columnNames = df.columns(); >>>>>> for (int i=0;i<columnNames.length;i++) { >>>>>> List<String> columnList = new ArrayList<String>(); >>>>>> >>>>>> columnList.add(columnNames[i]); >>>>>> >>>>>> >>>>>> List<Row> columnValues = >>>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); >>>>>> for (int j=0;j<columnValues.size();j++) >>>>>> columnList.add(columnValues.get(j).apply(0).toString()); >>>>>> >>>>>> finalList.add(columnList); >>>>>> >>>>>> >>>>>> How to improve this? >>>>>> >>>>>> Also, can I get the results in JSON format? >>>>>> >>>>>