c1 c2
c3 c4 c5 c6 1.2 true A Z 120 + 1.3 false B X 130 F + true C Y 200 G in the above table c1 has double values except on the last row so: Dataset<Row> dataset = spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path"); will yield StringType as a type for column c1 similarly for c6 I want to return the true type of each column by first discarding the "+" I use Dataset<String> after filtering the rows (removing "+") because i can re-read the new dataset using .csv() method. Any better idea to do that ? Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a écrit : > Can you provide an example string (row) and the expected inferred schema? > > Enrico > > > Am 04.06.22 um 18:36 schrieb marc nicole: > > How to do just that? i thought we only can inferSchema when we first read > the dataset, or am i wrong? > > Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit : > >> It sounds like you want to interpret the input as strings, do some >> processing, then infer the schema. That has nothing to do with construing >> the entire row as a string like "Row[foo=bar, baz=1]" >> >> On Sat, Jun 4, 2022 at 10:32 AM marc nicole <mk1853...@gmail.com> wrote: >> >>> Hi Sean, >>> >>> Thanks, actually I have a dataset where I want to inferSchema after >>> discarding the specific String value of "+". I do this because the column >>> would be considered StringType while if i remove that "+" value it will be >>> considered DoubleType for example or something else. Basically I want to >>> remove "+" from all dataset rows and then inferschema. >>> Here my idea is to filter the rows not equal to "+" for the target >>> columns (potentially all of them) and then use spark.read().csv() to read >>> the new filtered dataset with the option inferSchema which would then yield >>> correct column types. >>> What do you think? >>> >>> Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com> a écrit : >>> >>>> I don't think you want to do that. You get a string representation of >>>> structured data without the structure, at best. This is part of the reason >>>> it doesn't work directly this way. >>>> You can use a UDF to call .toString on the Row of course, but, again >>>> what are you really trying to do? >>>> >>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole <mk1853...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> How to convert a Dataset<Row> to a Dataset<String>? >>>>> What i have tried is: >>>>> >>>>> List<String> list = dataset.as(Encoders.STRING()).collectAsList(); >>>>> Dataset<String> datasetSt = spark.createDataset(list, Encoders.STRING()); >>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>>>> map struct... to Tuple1, but failed as the number of fields does not line >>>>> up >>>>> >>>>> Type of columns being String >>>>> How to solve this? >>>>> >>>> >