Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)
Hi Chhavi, Currently there is no way to handle backtick(`) spark StructType. Hence the field name a.b and `a.b` are completely different within StructType. To handle that, I have added a custom implementation fixing StringIndexer# validateAndTransformSchema. You can refer to the code on my github <https://github.com/skale1990/LearnSpark/blob/main/src/main/java/com/som/learnspark/TestCustomStringIndexer.scala> . *Regards,* *Someshwar Kale * On Sat, Jun 8, 2024 at 12:00 PM Chhavi Bansal wrote: > Hi Someshwar, > Thanks for the response, I have added my comments to the ticket > <https://issues.apache.org/jira/browse/SPARK-48463>. > > > Thanks, > Chhavi Bansal > > On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote: > >> As a fix, you may consider adding a transformer to rename columns >> (perhaps replace all columns with dot to underscore) and use the renamed >> columns in your pipeline as below- >> >> val renameColumn = new >> RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") >> val si = new >> StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee") >> val pipeline = new Pipeline().setStages(Array(renameColumn, si)) >> pipeline.fit(flattenedDf).transform(flattenedDf).show() >> >> >> refer my comment >> <https://issues.apache.org/jira/browse/SPARK-48463?focusedCommentId=17852751=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17852751> >> for >> elaboration. >> Thanks!! >> >> *Regards,* >> *Someshwar Kale* >> >> >> >> >> >> On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal >> wrote: >> >>> Hello team >>> I was exploring feature transformation exposed via Mllib on nested >>> dataset, and encountered an error while applying any transformer to a >>> column with dot notation naming. I thought of raising a ticket on spark >>> https://issues.apache.org/jira/browse/SPARK-48463, where I have >>> mentioned the entire scenario. >>> >>> I wanted to get suggestions on what would be the best way to solve the >>> problem while using the dot notation. One workaround is to use`_` while >>> flattening the dataframe, but that would mean having an additional overhead >>> to convert back to `.` (dot notation ) since that’s the convention for our >>> other flattened data. >>> >>> I would be happy to make a contribution to the code if someone can shed >>> some light on how this could be solved. >>> >>> >>> >>> -- >>> Thanks and Regards, >>> Chhavi Bansal >>> >> > > -- > Thanks and Regards, > Chhavi Bansal >
Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)
As a fix, you may consider adding a transformer to rename columns (perhaps replace all columns with dot to underscore) and use the renamed columns in your pipeline as below- val renameColumn = new RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") val si = new StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee") val pipeline = new Pipeline().setStages(Array(renameColumn, si)) pipeline.fit(flattenedDf).transform(flattenedDf).show() refer my comment <https://issues.apache.org/jira/browse/SPARK-48463?focusedCommentId=17852751=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17852751> for elaboration. Thanks!! *Regards,* *Someshwar Kale* On Thu, Jun 6, 2024 at 3:24 AM Chhavi Bansal wrote: > Hello team > I was exploring feature transformation exposed via Mllib on nested > dataset, and encountered an error while applying any transformer to a > column with dot notation naming. I thought of raising a ticket on spark > https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned > the entire scenario. > > I wanted to get suggestions on what would be the best way to solve the > problem while using the dot notation. One workaround is to use`_` while > flattening the dataframe, but that would mean having an additional overhead > to convert back to `.` (dot notation ) since that’s the convention for our > other flattened data. > > I would be happy to make a contribution to the code if someone can shed > some light on how this could be solved. > > > > -- > Thanks and Regards, > Chhavi Bansal >
Re: [Spark SQL]: Does Spark SQL support WAITFOR?
Hi Ram, Have you seen this stackoverflow query and response- https://stackoverflow.com/questions/39685744/apache-spark-how-to-cancel-job-in-code-and-kill-running-tasks if not, please have a look. seems to have a similar problem . *Regards,* *Someshwar Kale* On Fri, May 20, 2022 at 7:34 AM Artemis User wrote: > WAITFOR is part of the Transact-SQL and it's Microsoft SQL server > specific, not supported by Spark SQL. If you want to impose a delay in a > Spark program, you may want to use the thread sleep function in Java or > Scala. Hope this helps... > > On 5/19/22 1:45 PM, K. N. Ramachandran wrote: > > Hi Sean, > > I'm trying to test a timeout feature in a tool that uses Spark SQL. > Basically, if a long-running query exceeds a configured threshold, then the > query should be canceled. > I couldn't see a simple way to make a "sleep" SQL statement to test the > timeout. Instead, I just ran a "select count(*) from table" on a large > table to act as a query with a long duration. > > Is there any way to trigger a "sleep" like behavior in Spark SQL? > > Regards, > Ram > > On Tue, May 17, 2022 at 4:23 PM Sean Owen wrote: > >> I don't think that is standard SQL? what are you trying to do, and why >> not do it outside SQL? >> >> On Tue, May 17, 2022 at 6:03 PM K. N. Ramachandran >> wrote: >> >>> Gentle ping. Any info here would be great. >>> >>> Regards, >>> Ram >>> >>> On Sun, May 15, 2022 at 5:16 PM K. N. Ramachandran >>> wrote: >>> >>>> Hello Spark Users Group, >>>> >>>> I've just recently started working on tools that use Apache Spark. >>>> When I try WAITFOR in the spark-sql command line, I just get: >>>> >>>> Error: Error running query: >>>> org.apache.spark.sql.catalyst.parser.ParseException: >>>> >>>> mismatched input 'WAITFOR' expecting (.. list of allowed commands..) >>>> >>>> >>>> 1) Why is WAITFOR not allowed? Is there another way to get a process to >>>> sleep for a desired period of time? I'm trying to test a timeout issue and >>>> need to simulate a sleep behavior. >>>> >>>> >>>> 2) Is there documentation that outlines why WAITFOR is not supported? I >>>> did not find any good matches searching online. >>>> >>>> Thanks, >>>> Ram >>>> >>> >>> >>> -- >>> K.N.Ramachandran >>> Ph: 814-441-4279 >>> >> > > -- > K.N.Ramachandran > Ph: 814-441-4279 > > >
Re: Reading 7z file in spark
I would suggest to use other compression technique which is splittable for eg. Bzip2, lzo, lz4. On Wed, Jan 15, 2020, 1:32 AM Enrico Minack wrote: > Hi, > > Spark does not support 7z natively, but you can read any file in Spark: > > def read(stream: PortableDataStream): Iterator[String] = { > Seq(stream.getPath()).iterator } > > spark.sparkContext > .binaryFiles("*.7z") > .flatMap(file => read(file._2)) > .toDF("path") > .show(false) > > This scales with the number of files. A single large 7z file would not > scale well (a single partition). > > Any file that matches *.7z will be loaded via the read(stream: > PortableDataStream) method, which returns an iterator over the rows. This > method is executed on the executor and can implement the 7z specific code, > which is independent of Spark and should not be too hard (here it does not > open the input stream but returns the path only). > > If you are planning to read the same files more than once, then it would > be worth to first uncompress and convert them into files Spark supports. > Then Spark can scale much better. > > Regards, > Enrico > > > Am 13.01.20 um 13:31 schrieb HARSH TAKKAR: > > Hi, > > > Is it possible to read 7z compressed file in spark? > > > Kind Regards > Harsh Takkar > > >