Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi,

Currently there is no way to handle backtick(`) spark StructType. Hence the
field name a.b and `a.b` are completely different within StructType.

To handle that, I have added a custom implementation fixing StringIndexer#
validateAndTransformSchema. You can refer to the code on my github

.

*Regards,*
*Someshwar Kale *





On Sat, Jun 8, 2024 at 12:00 PM Chhavi Bansal 
wrote:

> Hi Someshwar,
> Thanks for the response, I have added my comments to the ticket
> .
>
>
> Thanks,
> Chhavi Bansal
>
> On Thu, 6 Jun 2024 at 17:28, Someshwar Kale  wrote:
>
>> As a fix, you may consider adding a transformer to rename columns
>> (perhaps replace all columns with dot to underscore) and use the renamed
>> columns in your pipeline as below-
>>
>> val renameColumn = new 
>> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
>> val si = new 
>> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
>> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
>> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>>
>>
>> refer my comment
>> 
>>  for
>> elaboration.
>> Thanks!!
>>
>> *Regards,*
>> *Someshwar Kale*
>>
>>
>>
>>
>>
>> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
>> wrote:
>>
>>> Hello team
>>> I was exploring feature transformation exposed via Mllib on nested
>>> dataset, and encountered an error while applying any transformer to a
>>> column with dot notation naming. I thought of raising a ticket on spark
>>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>>> mentioned the entire scenario.
>>>
>>> I wanted to get suggestions on what would be the best way to solve the
>>> problem while using the dot notation. One workaround is to use`_` while
>>> flattening the dataframe, but that would mean having an additional overhead
>>> to convert back to `.` (dot notation ) since that’s the convention for our
>>> other flattened data.
>>>
>>> I would be happy to make a contribution to the code if someone can shed
>>> some light on how this could be solved.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Chhavi Bansal
>>>
>>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar,
Thanks for the response, I have added my comments to the ticket
.


Thanks,
Chhavi Bansal

On Thu, 6 Jun 2024 at 17:28, Someshwar Kale  wrote:

> As a fix, you may consider adding a transformer to rename columns (perhaps
> replace all columns with dot to underscore) and use the renamed columns in
> your pipeline as below-
>
> val renameColumn = new 
> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
> val si = new 
> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>
>
> refer my comment
> 
>  for
> elaboration.
> Thanks!!
>
> *Regards,*
> *Someshwar Kale*
>
>
>
>
>
> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
> wrote:
>
>> Hello team
>> I was exploring feature transformation exposed via Mllib on nested
>> dataset, and encountered an error while applying any transformer to a
>> column with dot notation naming. I thought of raising a ticket on spark
>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>> mentioned the entire scenario.
>>
>> I wanted to get suggestions on what would be the best way to solve the
>> problem while using the dot notation. One workaround is to use`_` while
>> flattening the dataframe, but that would mean having an additional overhead
>> to convert back to `.` (dot notation ) since that’s the convention for our
>> other flattened data.
>>
>> I would be happy to make a contribution to the code if someone can shed
>> some light on how this could be solved.
>>
>>
>>
>> --
>> Thanks and Regards,
>> Chhavi Bansal
>>
>

-- 
Thanks and Regards,
Chhavi Bansal


Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps
replace all columns with dot to underscore) and use the renamed columns in
your pipeline as below-

val renameColumn = new
RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
val si = new 
StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
val pipeline = new Pipeline().setStages(Array(renameColumn, si))
pipeline.fit(flattenedDf).transform(flattenedDf).show()


refer my comment

for
elaboration.
Thanks!!

*Regards,*
*Someshwar Kale*





On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
wrote:

> Hello team
> I was exploring feature transformation exposed via Mllib on nested
> dataset, and encountered an error while applying any transformer to a
> column with dot notation naming. I thought of raising a ticket on spark
> https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned
> the entire scenario.
>
> I wanted to get suggestions on what would be the best way to solve the
> problem while using the dot notation. One workaround is to use`_` while
> flattening the dataframe, but that would mean having an additional overhead
> to convert back to `.` (dot notation ) since that’s the convention for our
> other flattened data.
>
> I would be happy to make a contribution to the code if someone can shed
> some light on how this could be solved.
>
>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


[SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-05 Thread Chhavi Bansal
Hello team
I was exploring feature transformation exposed via Mllib on nested dataset,
and encountered an error while applying any transformer to a column with
dot notation naming. I thought of raising a ticket on spark
https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned
the entire scenario.

I wanted to get suggestions on what would be the best way to solve the
problem while using the dot notation. One workaround is to use`_` while
flattening the dataframe, but that would mean having an additional overhead
to convert back to `.` (dot notation ) since that’s the convention for our
other flattened data.

I would be happy to make a contribution to the code if someone can shed
some light on how this could be solved.



-- 
Thanks and Regards,
Chhavi Bansal