Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi,

Currently there is no way to handle backtick(`) spark StructType. Hence the
field name a.b and `a.b` are completely different within StructType.

To handle that, I have added a custom implementation fixing StringIndexer#
validateAndTransformSchema. You can refer to the code on my github
<https://github.com/skale1990/LearnSpark/blob/main/src/main/java/com/som/learnspark/TestCustomStringIndexer.scala>
.

*Regards,*
*Someshwar Kale *





On Sat, Jun 8, 2024 at 12:00 PM Chhavi Bansal 
wrote:

> Hi Someshwar,
> Thanks for the response, I have added my comments to the ticket
> <https://issues.apache.org/jira/browse/SPARK-48463>.
>
>
> Thanks,
> Chhavi Bansal
>
> On Thu, 6 Jun 2024 at 17:28, Someshwar Kale  wrote:
>
>> As a fix, you may consider adding a transformer to rename columns
>> (perhaps replace all columns with dot to underscore) and use the renamed
>> columns in your pipeline as below-
>>
>> val renameColumn = new 
>> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
>> val si = new 
>> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
>> val pipeline = new Pipeline().setStages(Array(renameColumn, si))
>> pipeline.fit(flattenedDf).transform(flattenedDf).show()
>>
>>
>> refer my comment
>> <https://issues.apache.org/jira/browse/SPARK-48463?focusedCommentId=17852751=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17852751>
>>  for
>> elaboration.
>> Thanks!!
>>
>> *Regards,*
>> *Someshwar Kale*
>>
>>
>>
>>
>>
>> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
>> wrote:
>>
>>> Hello team
>>> I was exploring feature transformation exposed via Mllib on nested
>>> dataset, and encountered an error while applying any transformer to a
>>> column with dot notation naming. I thought of raising a ticket on spark
>>> https://issues.apache.org/jira/browse/SPARK-48463, where I have
>>> mentioned the entire scenario.
>>>
>>> I wanted to get suggestions on what would be the best way to solve the
>>> problem while using the dot notation. One workaround is to use`_` while
>>> flattening the dataframe, but that would mean having an additional overhead
>>> to convert back to `.` (dot notation ) since that’s the convention for our
>>> other flattened data.
>>>
>>> I would be happy to make a contribution to the code if someone can shed
>>> some light on how this could be solved.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Chhavi Bansal
>>>
>>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps
replace all columns with dot to underscore) and use the renamed columns in
your pipeline as below-

val renameColumn = new
RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude")
val si = new 
StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
val pipeline = new Pipeline().setStages(Array(renameColumn, si))
pipeline.fit(flattenedDf).transform(flattenedDf).show()


refer my comment
<https://issues.apache.org/jira/browse/SPARK-48463?focusedCommentId=17852751=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17852751>
for
elaboration.
Thanks!!

*Regards,*
*Someshwar Kale*





On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal 
wrote:

> Hello team
> I was exploring feature transformation exposed via Mllib on nested
> dataset, and encountered an error while applying any transformer to a
> column with dot notation naming. I thought of raising a ticket on spark
> https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned
> the entire scenario.
>
> I wanted to get suggestions on what would be the best way to solve the
> problem while using the dot notation. One workaround is to use`_` while
> flattening the dataframe, but that would mean having an additional overhead
> to convert back to `.` (dot notation ) since that’s the convention for our
> other flattened data.
>
> I would be happy to make a contribution to the code if someone can shed
> some light on how this could be solved.
>
>
>
> --
> Thanks and Regards,
> Chhavi Bansal
>


Re: [Spark SQL]: Does Spark SQL support WAITFOR?

2022-05-19 Thread Someshwar Kale
Hi Ram,

Have you seen this stackoverflow query and response-
https://stackoverflow.com/questions/39685744/apache-spark-how-to-cancel-job-in-code-and-kill-running-tasks
if not, please have a look. seems to have a similar problem .

*Regards,*
*Someshwar Kale*


On Fri, May 20, 2022 at 7:34 AM Artemis User  wrote:

> WAITFOR is part of the Transact-SQL and it's Microsoft SQL server
> specific, not supported by Spark SQL.  If you want to impose a delay in a
> Spark program, you may want to use the thread sleep function in Java or
> Scala.  Hope this helps...
>
> On 5/19/22 1:45 PM, K. N. Ramachandran wrote:
>
> Hi Sean,
>
> I'm trying to test a timeout feature in a tool that uses Spark SQL.
> Basically, if a long-running query exceeds a configured threshold, then the
> query should be canceled.
> I couldn't see a simple way to make a "sleep" SQL statement to test the
> timeout. Instead, I just ran a "select count(*) from table" on a large
> table to act as a query with a long duration.
>
> Is there any way to trigger a "sleep" like behavior in Spark SQL?
>
> Regards,
> Ram
>
> On Tue, May 17, 2022 at 4:23 PM Sean Owen  wrote:
>
>> I don't think that is standard SQL? what are you trying to do, and why
>> not do it outside SQL?
>>
>> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran 
>> wrote:
>>
>>> Gentle ping. Any info here would be great.
>>>
>>> Regards,
>>> Ram
>>>
>>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran 
>>> wrote:
>>>
>>>> Hello Spark Users Group,
>>>>
>>>> I've just recently started working on tools that use Apache Spark.
>>>> When I try WAITFOR in the spark-sql command line, I just get:
>>>>
>>>> Error: Error running query:
>>>> org.apache.spark.sql.catalyst.parser.ParseException:
>>>>
>>>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..)
>>>>
>>>>
>>>> 1) Why is WAITFOR not allowed? Is there another way to get a process to
>>>> sleep for a desired period of time? I'm trying to test a timeout issue and
>>>> need to simulate a sleep behavior.
>>>>
>>>>
>>>> 2) Is there documentation that outlines why WAITFOR is not supported? I
>>>> did not find any good matches searching online.
>>>>
>>>> Thanks,
>>>> Ram
>>>>
>>>
>>>
>>> --
>>> K.N.Ramachandran
>>> Ph: 814-441-4279
>>>
>>
>
> --
> K.N.Ramachandran
> Ph: 814-441-4279
>
>
>


Re: Reading 7z file in spark

2020-01-14 Thread Someshwar Kale
I would suggest to use other compression technique which is splittable for
eg. Bzip2, lzo, lz4.

On Wed, Jan 15, 2020, 1:32 AM Enrico Minack  wrote:

> Hi,
>
> Spark does not support 7z natively, but you can read any file in Spark:
>
> def read(stream: PortableDataStream): Iterator[String] = { 
> Seq(stream.getPath()).iterator }
>
> spark.sparkContext
>   .binaryFiles("*.7z")
>   .flatMap(file => read(file._2))
>   .toDF("path")
>   .show(false)
>
> This scales with the number of files. A single large 7z file would not
> scale well (a single partition).
>
> Any file that matches *.7z will be loaded via the read(stream:
> PortableDataStream) method, which returns an iterator over the rows. This
> method is executed on the executor and can implement the 7z specific code,
> which is independent of Spark and should not be too hard (here it does not
> open the input stream but returns the path only).
>
> If you are planning to read the same files more than once, then it would
> be worth to first uncompress and convert them into files Spark supports.
> Then Spark can scale much better.
>
> Regards,
> Enrico
>
>
> Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
>
> Hi,
>
>
> Is it possible to read 7z compressed file in spark?
>
>
> Kind Regards
> Harsh Takkar
>
>
>