Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

Someshwar Kale Sat, 08 Jun 2024 21:08:25 -0700

Hi Chhavi,

Currently there is no way to handle backtick(`) spark StructType. Hence the
field name a.b and `a.b` are completely different within StructType.


To handle that, I have added a custom implementation fixing StringIndexer#
validateAndTransformSchema. You can refer to the code on my github
<https://github.com/skale1990/LearnSpark/blob/main/src/main/java/com/som/learnspark/TestCustomStringIndexer.scala>
.

*Regards,*
*Someshwar Kale *





On Sat, Jun 8, 2024 at 12:00 PM Chhavi Bansal <meetchhavi1...@gmail.com>
wrote:

> Hi Someshwar,
> Thanks for the response, I have added my comments to the ticket
> <https://issues.apache.org/jira/browse/SPARK-48463>.
>
>
> Thanks,
> Chhavi Bansal
>
> On Thu, 6 Jun 2024 at 17:28, Someshwar Kale <skale1...@gmail.com> wrote:
>
>> As a fix, you may consider adding a transformer to rename columns
>> (perhaps replace all columns with dot to underscore) and use the renamed
>> columns in your pipeline as below-
>>
>> val renameColumn = new 
>> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
>> val si = new 
>> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
>> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
>> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>>
>>
>> refer my comment
>> <https://issues.apache.org/jira/browse/SPARK-48463?focusedCommentId=17852751&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17852751>
>>  for
>> elaboration.
>> Thanks!!
>>
>> *Regards,*
>> *Someshwar Kale*
>>
>>
>>
>>
>>
>> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal <meetchhavi1...@gmail.com>
>> wrote:
>>
>>> Hello team
>>> I was exploring feature transformation exposed via Mllib on nested
>>> dataset, and encountered an error while applying any transformer to a
>>> column with dot notation naming. I thought of raising a ticket on spark
>>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>>> mentioned the entire scenario.
>>>
>>> I wanted to get suggestions on what would be the best way to solve the
>>> problem while using the dot notation. One workaround is to use`_` while
>>> flattening the dataframe, but that would mean having an additional overhead
>>> to convert back to `.` (dot notation ) since that’s the convention for our
>>> other flattened data.
>>>
>>> I would be happy to make a contribution to the code if someone can shed
>>> some light on how this could be solved.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Chhavi Bansal
>>>
>>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

Reply via email to