Re: How to convert a Dataset to a Dataset?

marc nicole Sat, 04 Jun 2022 10:38:16 -0700

Yes, Thanks Enrico, that was greatly helpful!
To note that i was looking at some similar option at the docs but couldn't
stumble on one.
Thanks.


Le sam. 4 juin 2022 à 19:29, Enrico Minack <i...@enrico.minack.dev> a
écrit :

> You could use .option("nullValue", "+") to tell the parser that '+' refers
> to "no value":
>
> spark.read
>      .option("inferSchema", "true")
>      .option("header", "true")
>      .option("nullvalue", "+")
>      .csv("path")
>
> Enrico
>
>
> Am 04.06.22 um 18:54 schrieb marc nicole:
>
> c1
>
> c2
>
> c3
>
> c4
>
> c5
>
> c6
>
> 1.2
>
> true
>
> A
>
> Z
>
> 120
>
> +
>
> 1.3
>
> false
>
> B
>
> X
>
> 130
>
> F
>
> +
>
> true
>
> C
>
> Y
>
> 200
>
> G
> in the above table c1 has double values except on the last row so:
>
> Dataset<Row> dataset =
> spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
> will yield StringType as a type for column c1 similarly for c6
> I want to return the true type of each column by first discarding the "+"
> I use Dataset<String> after filtering the rows (removing "+") because i
> can re-read the new dataset using .csv() method.
> Any better idea to do that ?
>
> Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a
> écrit :
>
>> Can you provide an example string (row) and the expected inferred schema?
>>
>> Enrico
>>
>>
>> Am 04.06.22 um 18:36 schrieb marc nicole:
>>
>> How to do just that? i thought we only can inferSchema when we first read
>> the dataset, or am i wrong?
>>
>> Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit :
>>
>>> It sounds like you want to interpret the input as strings, do some
>>> processing, then infer the schema. That has nothing to do with construing
>>> the entire row as a string like "Row[foo=bar, baz=1]"
>>>
>>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole <mk1853...@gmail.com> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Thanks, actually I have a dataset where I want to inferSchema after
>>>> discarding the specific String value of "+". I do this because the column
>>>> would be considered StringType while if i remove that "+" value it will be
>>>> considered DoubleType for example or something else. Basically I want to
>>>> remove "+" from all dataset rows and then inferschema.
>>>> Here my idea is to filter the rows not equal to "+" for the target
>>>> columns (potentially all of them) and then use spark.read().csv() to read
>>>> the new filtered dataset with the option inferSchema which would then yield
>>>> correct column types.
>>>> What do you think?
>>>>
>>>> Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com> a écrit :
>>>>
>>>>> I don't think you want to do that. You get a string representation of
>>>>> structured data without the structure, at best. This is part of the reason
>>>>> it doesn't work directly this way.
>>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>>> what are you really trying to do?
>>>>>
>>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole <mk1853...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> How to convert a Dataset<Row> to a Dataset<String>?
>>>>>> What i have tried is:
>>>>>>
>>>>>> List<String> list = dataset.as(Encoders.STRING()).collectAsList();
>>>>>> Dataset<String> datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>>> up
>>>>>>
>>>>>> Type of columns being String
>>>>>> How to solve this?
>>>>>>
>>>>>
>>
>

Re: How to convert a Dataset to a Dataset?

Reply via email to