Yes, Thanks Enrico, that was greatly helpful! To note that i was looking at some similar option at the docs but couldn't stumble on one. Thanks.
Le sam. 4 juin 2022 à 19:29, Enrico Minack <i...@enrico.minack.dev> a écrit : > You could use .option("nullValue", "+") to tell the parser that '+' refers > to "no value": > > spark.read > .option("inferSchema", "true") > .option("header", "true") > .option("nullvalue", "+") > .csv("path") > > Enrico > > > Am 04.06.22 um 18:54 schrieb marc nicole: > > c1 > > c2 > > c3 > > c4 > > c5 > > c6 > > 1.2 > > true > > A > > Z > > 120 > > + > > 1.3 > > false > > B > > X > > 130 > > F > > + > > true > > C > > Y > > 200 > > G > in the above table c1 has double values except on the last row so: > > Dataset<Row> dataset = > spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path"); > will yield StringType as a type for column c1 similarly for c6 > I want to return the true type of each column by first discarding the "+" > I use Dataset<String> after filtering the rows (removing "+") because i > can re-read the new dataset using .csv() method. > Any better idea to do that ? > > Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a > écrit : > >> Can you provide an example string (row) and the expected inferred schema? >> >> Enrico >> >> >> Am 04.06.22 um 18:36 schrieb marc nicole: >> >> How to do just that? i thought we only can inferSchema when we first read >> the dataset, or am i wrong? >> >> Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit : >> >>> It sounds like you want to interpret the input as strings, do some >>> processing, then infer the schema. That has nothing to do with construing >>> the entire row as a string like "Row[foo=bar, baz=1]" >>> >>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole <mk1853...@gmail.com> wrote: >>> >>>> Hi Sean, >>>> >>>> Thanks, actually I have a dataset where I want to inferSchema after >>>> discarding the specific String value of "+". I do this because the column >>>> would be considered StringType while if i remove that "+" value it will be >>>> considered DoubleType for example or something else. Basically I want to >>>> remove "+" from all dataset rows and then inferschema. >>>> Here my idea is to filter the rows not equal to "+" for the target >>>> columns (potentially all of them) and then use spark.read().csv() to read >>>> the new filtered dataset with the option inferSchema which would then yield >>>> correct column types. >>>> What do you think? >>>> >>>> Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com> a écrit : >>>> >>>>> I don't think you want to do that. You get a string representation of >>>>> structured data without the structure, at best. This is part of the reason >>>>> it doesn't work directly this way. >>>>> You can use a UDF to call .toString on the Row of course, but, again >>>>> what are you really trying to do? >>>>> >>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole <mk1853...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> How to convert a Dataset<Row> to a Dataset<String>? >>>>>> What i have tried is: >>>>>> >>>>>> List<String> list = dataset.as(Encoders.STRING()).collectAsList(); >>>>>> Dataset<String> datasetSt = spark.createDataset(list, Encoders.STRING()); >>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>>>>> map struct... to Tuple1, but failed as the number of fields does not line >>>>>> up >>>>>> >>>>>> Type of columns being String >>>>>> How to solve this? >>>>>> >>>>> >> >