Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-08 Thread Perez
Hi Team,

this is the problem
https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue

I can't go ahead with *StructType* approach since my input record is huge
and if the underlying attributes are added or removed my code might fail.

I can't change the source data either.

The only thing I can think of is loading via Python client with multiple
threads but do let me know if there is another solution for this.

TIA


Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi,

Currently there is no way to handle backtick(`) spark StructType. Hence the
field name a.b and `a.b` are completely different within StructType.

To handle that, I have added a custom implementation fixing StringIndexer#
validateAndTransformSchema. You can refer to the code on my github

.

*Regards,*
*Someshwar Kale *





On Sat, Jun 8, 2024 at 12:00 PM Chhavi Bansal 
wrote:

> Hi Someshwar,
> Thanks for the response, I have added my comments to the ticket
> .
>
>
> Thanks,
> Chhavi Bansal
>
> On Thu, 6 Jun 2024 at 17:28, Someshwar Kale  wrote:
>
>> As a fix, you may consider adding a transformer to rename columns
>> (perhaps replace all columns with dot to underscore) and use the renamed
>> columns in your pipeline as below-
>>
>> val renameColumn = new 
>> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
>> val si = new 
>> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
>> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
>> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>>
>>
>> refer my comment
>> 
>>  for
>> elaboration.
>> Thanks!!
>>
>> *Regards,*
>> *Someshwar Kale*
>>
>>
>>
>>
>>
>> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
>> wrote:
>>
>>> Hello team
>>> I was exploring feature transformation exposed via Mllib on nested
>>> dataset, and encountered an error while applying any transformer to a
>>> column with dot notation naming. I thought of raising a ticket on spark
>>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>>> mentioned the entire scenario.
>>>
>>> I wanted to get suggestions on what would be the best way to solve the
>>> problem while using the dot notation. One workaround is to use`_` while
>>> flattening the dataframe, but that would mean having an additional overhead
>>> to convert back to `.` (dot notation ) since that’s the convention for our
>>> other flattened data.
>>>
>>> I would be happy to make a contribution to the code if someone can shed
>>> some light on how this could be solved.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Chhavi Bansal
>>>
>>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar,
Thanks for the response, I have added my comments to the ticket
.


Thanks,
Chhavi Bansal

On Thu, 6 Jun 2024 at 17:28, Someshwar Kale  wrote:

> As a fix, you may consider adding a transformer to rename columns (perhaps
> replace all columns with dot to underscore) and use the renamed columns in
> your pipeline as below-
>
> val renameColumn = new 
> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
> val si = new 
> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>
>
> refer my comment
> 
>  for
> elaboration.
> Thanks!!
>
> *Regards,*
> *Someshwar Kale*
>
>
>
>
>
> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
> wrote:
>
>> Hello team
>> I was exploring feature transformation exposed via Mllib on nested
>> dataset, and encountered an error while applying any transformer to a
>> column with dot notation naming. I thought of raising a ticket on spark
>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>> mentioned the entire scenario.
>>
>> I wanted to get suggestions on what would be the best way to solve the
>> problem while using the dot notation. One workaround is to use`_` while
>> flattening the dataframe, but that would mean having an additional overhead
>> to convert back to `.` (dot notation ) since that’s the convention for our
>> other flattened data.
>>
>> I would be happy to make a contribution to the code if someone can shed
>> some light on how this could be solved.
>>
>>
>>
>> --
>> Thanks and Regards,
>> Chhavi Bansal
>>
>

-- 
Thanks and Regards,
Chhavi Bansal


Re: OOM issue in Spark Driver

2024-06-08 Thread Andrzej Zera
Hey, do you perform stateful operations? Maybe your state is growing
indefinitely - a screenshot with state metrics would help (you can find it
in Spark UI -> Structured Streaming -> your query). Do you have a
driver-only cluster or do you have workers too? What's the memory usage
profile at workers?

Regards,
Andrzej


sob., 8 cze 2024 o 10:39 Karthick Nk  napisał(a):

> Hi All,
>
> I am using the pyspark structure streaming with Azure Databricks for data
> load process.
>
> In the Pipeline I am using a Job cluster and I am running only one
> pipeline, I am getting the OUT OF MEMORY issue while running for a
> long time. When I inspect the metrics of the cluster I found that, the
> memory usage getting increased by time by time even when there is no
> huge volume of data.
>
> [image: image.png]
>
>
> [image: image.png]
>
> After 4 hours of running the pipeline continuously, I am getting out of
> memory issue where used memory in the driver getting increased from 47 GB
> to 111 GB which is almost triple, I am unable to understand why this many
> memory occcupied in the driver. Am I missing anything here to notice? Could
> you guide me to figure out the root cause?
>
> Note:
> 1. I confirmed persist and unpersist that I used in code taken care
> properly for every batch execution.
> 2. Data is not increasing when time passes, (stream data getting almost
> same amount of data for every batch)
>
>
> Thanks,
>
>
>
>