Re: How to convert a Dataset to a Dataset?

marc nicole Sat, 04 Jun 2022 09:50:54 -0700

c1

c2


c3

c4

c5

c6

1.2

true

A

Z

120

+

1.3

false

B

X

130

F

+

true

C

Y

200

G
in the above table c1 has double values except on the last row so:

Dataset<Row> dataset =
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset<String> after filtering the rows (removing "+") because i can
re-read the new dataset using .csv() method.
Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a
écrit :

> Can you provide an example string (row) and the expected inferred schema?
>
> Enrico
>
>
> Am 04.06.22 um 18:36 schrieb marc nicole:
>
> How to do just that? i thought we only can inferSchema when we first read
> the dataset, or am i wrong?
>
> Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit :
>
>> It sounds like you want to interpret the input as strings, do some
>> processing, then infer the schema. That has nothing to do with construing
>> the entire row as a string like "Row[foo=bar, baz=1]"
>>
>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole <mk1853...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks, actually I have a dataset where I want to inferSchema after
>>> discarding the specific String value of "+". I do this because the column
>>> would be considered StringType while if i remove that "+" value it will be
>>> considered DoubleType for example or something else. Basically I want to
>>> remove "+" from all dataset rows and then inferschema.
>>> Here my idea is to filter the rows not equal to "+" for the target
>>> columns (potentially all of them) and then use spark.read().csv() to read
>>> the new filtered dataset with the option inferSchema which would then yield
>>> correct column types.
>>> What do you think?
>>>
>>> Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com> a écrit :
>>>
>>>> I don't think you want to do that. You get a string representation of
>>>> structured data without the structure, at best. This is part of the reason
>>>> it doesn't work directly this way.
>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>> what are you really trying to do?
>>>>
>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole <mk1853...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> How to convert a Dataset<Row> to a Dataset<String>?
>>>>> What i have tried is:
>>>>>
>>>>> List<String> list = dataset.as(Encoders.STRING()).collectAsList();
>>>>> Dataset<String> datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>> up
>>>>>
>>>>> Type of columns being String
>>>>> How to solve this?
>>>>>
>>>>
>

Re: How to convert a Dataset to a Dataset?

Reply via email to