Re: partitionBy creating lot of small files

2022-06-04 Thread Enrico Minack
You refer to df.write.partitionBy, which creates for each value of "col" 
a directory, and in worst-case writes one file per DataFrame partition. 
So the number of output files is controlled by cardinality of "col", 
which is your data and hence out of control, and the number of 
partitions of your DataFrame.


The only way to change the number of DataFrame partitions without 
repartition / shuffle all data is to use coalesce (as you already 
mentioned in an earlier post).


Repartition the DataFrame with the same column that you partitionBy will 
output a single file per col1-partition:


|ds.repartition(100, $"col1") .write .partitionBy("col1") 
.parquet("data.parquet")|


Large col1-values with much data will have a large file and col1-values 
with few data will have a small file.


If even-sized files is of great value for you, repartition / shuffle or 
even range partition might pay off:


|ds.repartitionByRange(100, $"col1", $"col2") .write 
.partitionBy("col1") .parquet("data.parquet")|


This will give you equal-size files (given (col1, col2) has even 
distribution) with many files for large col1-partitions and few files 
for small col1-partitions.


You can even emulate some kind of bucketing with:

|ds|||.withColumn("month", month($"timestamp")) |.withColumn("year", 
year($"timestamp")) .repartitionByRange(100, $"year", $"month", $"id", 
$"time") .write .partitionBy("year", "month") .parquet("data.parquet")|


Files have similar size while large months have more files than small 
months.


https://github.com/G-Research/spark-extension/blob/master/PARTITIONING.md

Enrico


Am 04.06.22 um 18:44 schrieb Nikhil Goyal:

Hi all,

Is there a way to use dataframe.partitionBy("col") and control the 
number of output files without doing a full repartition? The thing is 
some partitions have more data while some have less. Doing a 
.repartition is a costly operation. We want to control the size of the 
output files. Is it even possible?


Thanks




Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Yes, Thanks Enrico, that was greatly helpful!
To note that i was looking at some similar option at the docs but couldn't
stumble on one.
Thanks.

Le sam. 4 juin 2022 à 19:29, Enrico Minack  a
écrit :

> You could use .option("nullValue", "+") to tell the parser that '+' refers
> to "no value":
>
> spark.read
>  .option("inferSchema", "true")
>  .option("header", "true")
>  .option("nullvalue", "+")
>  .csv("path")
>
> Enrico
>
>
> Am 04.06.22 um 18:54 schrieb marc nicole:
>
> c1
>
> c2
>
> c3
>
> c4
>
> c5
>
> c6
>
> 1.2
>
> true
>
> A
>
> Z
>
> 120
>
> +
>
> 1.3
>
> false
>
> B
>
> X
>
> 130
>
> F
>
> +
>
> true
>
> C
>
> Y
>
> 200
>
> G
> in the above table c1 has double values except on the last row so:
>
> Dataset dataset =
> spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
> will yield StringType as a type for column c1 similarly for c6
> I want to return the true type of each column by first discarding the "+"
> I use Dataset after filtering the rows (removing "+") because i
> can re-read the new dataset using .csv() method.
> Any better idea to do that ?
>
> Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
> écrit :
>
>> Can you provide an example string (row) and the expected inferred schema?
>>
>> Enrico
>>
>>
>> Am 04.06.22 um 18:36 schrieb marc nicole:
>>
>> How to do just that? i thought we only can inferSchema when we first read
>> the dataset, or am i wrong?
>>
>> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>>
>>> It sounds like you want to interpret the input as strings, do some
>>> processing, then infer the schema. That has nothing to do with construing
>>> the entire row as a string like "Row[foo=bar, baz=1]"
>>>
>>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>>
 Hi Sean,

 Thanks, actually I have a dataset where I want to inferSchema after
 discarding the specific String value of "+". I do this because the column
 would be considered StringType while if i remove that "+" value it will be
 considered DoubleType for example or something else. Basically I want to
 remove "+" from all dataset rows and then inferschema.
 Here my idea is to filter the rows not equal to "+" for the target
 columns (potentially all of them) and then use spark.read().csv() to read
 the new filtered dataset with the option inferSchema which would then yield
 correct column types.
 What do you think?

 Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :

> I don't think you want to do that. You get a string representation of
> structured data without the structure, at best. This is part of the reason
> it doesn't work directly this way.
> You can use a UDF to call .toString on the Row of course, but, again
> what are you really trying to do?
>
> On Sat, Jun 4, 2022 at 7:35 AM marc nicole 
> wrote:
>
>> Hi,
>> How to convert a Dataset to a Dataset?
>> What i have tried is:
>>
>> List list = dataset.as(Encoders.STRING()).collectAsList();
>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>> map struct... to Tuple1, but failed as the number of fields does not line
>> up
>>
>> Type of columns being String
>> How to solve this?
>>
>
>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Enrico Minack
You could use .option("nullValue", "+") to tell the parser that '+' 
refers to "no value":


spark.read
.option("inferSchema", "true")
.option("header", "true")
.option("nullvalue", "+")
 .csv("path")

Enrico


Am 04.06.22 um 18:54 schrieb marc nicole:


c1



c2



c3



c4



c5



c6

1.2



true



A



Z



120



+

1.3



false



B



X



130



F

+



true



C



Y



200



G

in the above table c1 has double values except on the last row so:

Dataset dataset = 
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");

will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset after filtering the rows (removing "+") because 
i can re-read the new dataset using .csv() method.

Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack  a 
écrit :


Can you provide an example string (row) and the expected inferred
schema?

Enrico


Am 04.06.22 um 18:36 schrieb marc nicole:

How to do just that? i thought we only can inferSchema when we
first read the dataset, or am i wrong?

Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

It sounds like you want to interpret the input as strings, do
some processing, then infer the schema. That has nothing to
do with construing the entire row as a string like
"Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole
 wrote:

Hi Sean,

Thanks, actually I have a dataset where I want to
inferSchema after discarding the specific String value of
"+". I do this because the column would be considered
StringType while if i remove that "+" value it will be
considered DoubleType for example or something else.
Basically I want to remove "+" from all dataset rows and
then inferschema.
Here my idea is to filter the rows not equal to "+" for
the target columns (potentially all of them) and then use
spark.read().csv() to read the new filtered dataset with
the option inferSchema which would then yield correct
column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen 
a écrit :

I don't think you want to do that. You get a string
representation of structured data without the
structure, at best. This is part of the reason it
doesn't work directly this way.
You can use a UDF to call .toString on the Row of
course, but, again what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole
 wrote:

Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as
(Encoders.STRING()).collectAsList();
Dataset datasetSt =
spark.createDataset(list, Encoders.STRING()); //
But this line raises
a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number
of fields does not line up

Type of columns being String
How to solve this?





Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
c1

c2

c3

c4

c5

c6

1.2

true

A

Z

120

+

1.3

false

B

X

130

F

+

true

C

Y

200

G
in the above table c1 has double values except on the last row so:

Dataset dataset =
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset after filtering the rows (removing "+") because i can
re-read the new dataset using .csv() method.
Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
écrit :

> Can you provide an example string (row) and the expected inferred schema?
>
> Enrico
>
>
> Am 04.06.22 um 18:36 schrieb marc nicole:
>
> How to do just that? i thought we only can inferSchema when we first read
> the dataset, or am i wrong?
>
> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>
>> It sounds like you want to interpret the input as strings, do some
>> processing, then infer the schema. That has nothing to do with construing
>> the entire row as a string like "Row[foo=bar, baz=1]"
>>
>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks, actually I have a dataset where I want to inferSchema after
>>> discarding the specific String value of "+". I do this because the column
>>> would be considered StringType while if i remove that "+" value it will be
>>> considered DoubleType for example or something else. Basically I want to
>>> remove "+" from all dataset rows and then inferschema.
>>> Here my idea is to filter the rows not equal to "+" for the target
>>> columns (potentially all of them) and then use spark.read().csv() to read
>>> the new filtered dataset with the option inferSchema which would then yield
>>> correct column types.
>>> What do you think?
>>>
>>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>>
 I don't think you want to do that. You get a string representation of
 structured data without the structure, at best. This is part of the reason
 it doesn't work directly this way.
 You can use a UDF to call .toString on the Row of course, but, again
 what are you really trying to do?

 On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:

> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as(Encoders.STRING()).collectAsList();
> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
> map struct... to Tuple1, but failed as the number of fields does not line
> up
>
> Type of columns being String
> How to solve this?
>

>


partitionBy creating lot of small files

2022-06-04 Thread Nikhil Goyal
Hi all,

Is there a way to use dataframe.partitionBy("col") and control the number
of output files without doing a full repartition? The thing is some
partitions have more data while some have less. Doing a .repartition is a
costly operation. We want to control the size of the output files. Is it
even possible?

Thanks


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Enrico Minack

Can you provide an example string (row) and the expected inferred schema?

Enrico


Am 04.06.22 um 18:36 schrieb marc nicole:
How to do just that? i thought we only can inferSchema when we first 
read the dataset, or am i wrong?


Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

It sounds like you want to interpret the input as strings, do some
processing, then infer the schema. That has nothing to do with
construing the entire row as a string like "Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole 
wrote:

Hi Sean,

Thanks, actually I have a dataset where I want to inferSchema
after discarding the specific String value of "+". I do this
because the column would be considered StringType while if i
remove that "+" value it will be considered DoubleType for
example or something else. Basically I want to remove "+" from
all dataset rows and then inferschema.
Here my idea is to filter the rows not equal to "+" for the
target columns (potentially all of them) and then use
spark.read().csv() to read the new filtered dataset with the
option inferSchema which would then yield correct column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen  a
écrit :

I don't think you want to do that. You get a string
representation of structured data without the structure,
at best. This is part of the reason it doesn't work
directly this way.
You can use a UDF to call .toString on the Row of course,
but, again what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole
 wrote:

Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as
(Encoders.STRING()).collectAsList();
Dataset datasetSt = spark.createDataset(list,
Encoders.STRING()); // But this line raises
a org.apache.spark.sql.AnalysisException: Try to map
struct... to Tuple1, but failed as the number of
fields does not line up

Type of columns being String
How to solve this?



Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
How to do just that? i thought we only can inferSchema when we first read
the dataset, or am i wrong?

Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

> It sounds like you want to interpret the input as strings, do some
> processing, then infer the schema. That has nothing to do with construing
> the entire row as a string like "Row[foo=bar, baz=1]"
>
> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>
>> Hi Sean,
>>
>> Thanks, actually I have a dataset where I want to inferSchema after
>> discarding the specific String value of "+". I do this because the column
>> would be considered StringType while if i remove that "+" value it will be
>> considered DoubleType for example or something else. Basically I want to
>> remove "+" from all dataset rows and then inferschema.
>> Here my idea is to filter the rows not equal to "+" for the target
>> columns (potentially all of them) and then use spark.read().csv() to read
>> the new filtered dataset with the option inferSchema which would then yield
>> correct column types.
>> What do you think?
>>
>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>
>>> I don't think you want to do that. You get a string representation of
>>> structured data without the structure, at best. This is part of the reason
>>> it doesn't work directly this way.
>>> You can use a UDF to call .toString on the Row of course, but, again
>>> what are you really trying to do?
>>>
>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>>
 Hi,
 How to convert a Dataset to a Dataset?
 What i have tried is:

 List list = dataset.as(Encoders.STRING()).collectAsList();
 Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
 // But this line raises a org.apache.spark.sql.AnalysisException: Try to
 map struct... to Tuple1, but failed as the number of fields does not line
 up

 Type of columns being String
 How to solve this?

>>>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
It sounds like you want to interpret the input as strings, do some
processing, then infer the schema. That has nothing to do with construing
the entire row as a string like "Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:

> Hi Sean,
>
> Thanks, actually I have a dataset where I want to inferSchema after
> discarding the specific String value of "+". I do this because the column
> would be considered StringType while if i remove that "+" value it will be
> considered DoubleType for example or something else. Basically I want to
> remove "+" from all dataset rows and then inferschema.
> Here my idea is to filter the rows not equal to "+" for the target columns
> (potentially all of them) and then use spark.read().csv() to read the new
> filtered dataset with the option inferSchema which would then yield correct
> column types.
> What do you think?
>
> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>
>> I don't think you want to do that. You get a string representation of
>> structured data without the structure, at best. This is part of the reason
>> it doesn't work directly this way.
>> You can use a UDF to call .toString on the Row of course, but, again
>> what are you really trying to do?
>>
>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>
>>> Hi,
>>> How to convert a Dataset to a Dataset?
>>> What i have tried is:
>>>
>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>> map struct... to Tuple1, but failed as the number of fields does not line
>>> up
>>>
>>> Type of columns being String
>>> How to solve this?
>>>
>>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi Sean,

Thanks, actually I have a dataset where I want to inferSchema after
discarding the specific String value of "+". I do this because the column
would be considered StringType while if i remove that "+" value it will be
considered DoubleType for example or something else. Basically I want to
remove "+" from all dataset rows and then inferschema.
Here my idea is to filter the rows not equal to "+" for the target columns
(potentially all of them) and then use spark.read().csv() to read the new
filtered dataset with the option inferSchema which would then yield correct
column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :

> I don't think you want to do that. You get a string representation of
> structured data without the structure, at best. This is part of the reason
> it doesn't work directly this way.
> You can use a UDF to call .toString on the Row of course, but, again
> what are you really trying to do?
>
> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>
>> Hi,
>> How to convert a Dataset to a Dataset?
>> What i have tried is:
>>
>> List list = dataset.as(Encoders.STRING()).collectAsList();
>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>> map struct... to Tuple1, but failed as the number of fields does not line
>> up
>>
>> Type of columns being String
>> How to solve this?
>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
I don't think you want to do that. You get a string representation of
structured data without the structure, at best. This is part of the reason
it doesn't work directly this way.
You can use a UDF to call .toString on the Row of course, but, again
what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:

> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as(Encoders.STRING()).collectAsList();
> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
> map struct... to Tuple1, but failed as the number of fields does not line
> up
>
> Type of columns being String
> How to solve this?
>


How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as(Encoders.STRING()).collectAsList();
Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
// But this line raises a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number of fields does not line
up

Type of columns being String
How to solve this?