Hello Ayan,
Thank you for the suggestion. But, I would lose correlation of the JSON
file with the other identifier fields. Also, if there are too many files,
will it be an issue? Plus, I may not have the same schema across all the
files.
Hello Enrico,
>how does RDD's mapPartitions make a
Hello,
Spark is adding entry to pending microbatches queue at periodic batch interval.
Is there config to set the max size for pending microbatches queue ?
Thanks
Another option is:
1. collect the dataframe with file path
2. create a list of paths
3. create a new dataframe with spark.read.json and pass the list of path
This will save you lots of headache
Ayan
On Wed, Jul 13, 2022 at 7:35 AM Enrico Minack
wrote:
> Hi,
>
> how does RDD's mapPartitions
Hi,
how does RDD's mapPartitions make a difference regarding 1. and 2.
compared to Dataset's mapPartitions / map function?
Enrico
Am 12.07.22 um 22:13 schrieb Muthu Jayakumar:
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions`
API of RDD to perform this
Hello Enrico,
Thanks for the reply. I found that I would have to use `mapPartitions` API
of RDD to perform this safely as I have to
1. Read each file from GCS using HDFS FileSystem API.
2. Parse each JSON record in a safe manner.
For (1) to work, I do have to broadcast HadoopConfiguration from
I have some problems that I am looking for if there is no solution for them
(due to the current implementation) or if there is a way and I was not aware of
it.
1)
Currently, we can enable and configure dynamic resource allocation based on
below documentation.