[jira] [Updated] (SPARK-39307) Proposed spark.SQL.function to return an array of fixed sized collections from a collection (array or string)

Aishwarya Srivastava (Jira) Thu, 26 May 2022 18:49:19 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aishwarya Srivastava updated SPARK-39307:
-----------------------------------------
    Description: 
In scenarios where a lot of UDFs centered around chunking collections into 
fixed* sized sub-arrays or arrays of strings.

The only alternative is to use complex and expensive regex that does not work 
for non-String use cases.

*left over elements carrying the remainder of the size

Example

===

scala> def grouped_by_size[C](size: Int, collection: Iterable[C]) = 
collection.grouped(size).toArray
grouped_by_size: [C](size: Int, collection: Iterable[C])Array[Iterable[C]]

scala> grouped_by_size(2, "Vishal Singhrd")
res0: Array[Iterable[Char]] = Array(Vi, sh, al, S, in, gh, rd)

scala> grouped_by_size(2, (0 to 4).toSeq )
res1: Array[Iterable[Int]] = Array(Vector(0, 1), Vector(2, 3), Vector(4))

scala> grouped_by_size(3, "Vishal Singhrd")
res2: Array[Iterable[Char]] = Array(Vis, hal, Si, ngh, rd)

scala> grouped_by_size(3, Array("This","is","my","last","example"))
res3: Array[Iterable[String]] = Array(WrappedArray(This, is, my), 
WrappedArray(last, example))
 * Elements in an array or string, depending on what the data domain is, may 
have a natural periodicity that has semantic meaning.
 * Being able to easily divide a large collection into an Array of Arrays or 
Array of String aids in applying transforms, explodes, and other array 
functions (zip, etc) at the appropriate level of periodicity for the data 
domain.

Pros:
 * Current methods for creating periodicity require Strings only as well as 
complex and inefficient regular expressions instead a simpler and more direct 
solution. The proposed solution would work for both Strings and Arrays.

  was:
In scenarios where a lot of UDFs centered around chunking collections into 
fixed* sized sub-arrays or arrays of strings.

The only alternative is to use complex and expensive regex that does not work 
for non-String use cases.

*left over elements carrying the remainder of the size

Example

===

scala> def grouped_by_size[C](size: Int, collection: Iterable[C]) = 
collection.grouped(size).toArray
grouped_by_size: [C](size: Int, collection: Iterable[C])Array[Iterable[C]]

scala> grouped_by_size(2, "Samuel Shepard")
res0: Array[Iterable[Char]] = Array(Sa, mu, el, S, he, pa, rd)

scala> grouped_by_size(2, (0 to 4).toSeq )
res1: Array[Iterable[Int]] = Array(Vector(0, 1), Vector(2, 3), Vector(4))

scala> grouped_by_size(3, "Samuel Shepard")
res2: Array[Iterable[Char]] = Array(Sam, uel, Sh, epa, rd)

scala> grouped_by_size(3, Array("This","is","my","last","example"))
res3: Array[Iterable[String]] = Array(WrappedArray(This, is, my), 
WrappedArray(last, example))
 * Elements in an array or string, depending on what the data domain is, may 
have a natural periodicity that has semantic meaning.
 * Being able to easily divide a large collection into an Array of Arrays or 
Array of String aids in applying transforms, explodes, and other array 
functions (zip, etc) at the appropriate level of periodicity for the data 
domain.

Pros:
 * Current methods for creating periodicity require Strings only as well as 
complex and inefficient regular expressions instead a simpler and more direct 
solution. The proposed solution would work for both Strings and Arrays.

        Summary: Proposed spark.SQL.function to return an array of fixed sized 
collections from a collection (array or string)  (was:   Need a 
spark.sql.function to return array from a collection of arrays)

> Proposed spark.SQL.function to return an array of fixed sized collections 
> from a collection (array or string)
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39307
>                 URL: https://issues.apache.org/jira/browse/SPARK-39307
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: Aishwarya Srivastava
>            Priority: Major
>
> In scenarios where a lot of UDFs centered around chunking collections into 
> fixed* sized sub-arrays or arrays of strings.
> The only alternative is to use complex and expensive regex that does not work 
> for non-String use cases.
> *left over elements carrying the remainder of the size
> Example
> ===
> scala> def grouped_by_size[C](size: Int, collection: Iterable[C]) = 
> collection.grouped(size).toArray
> grouped_by_size: [C](size: Int, collection: Iterable[C])Array[Iterable[C]]
> scala> grouped_by_size(2, "Vishal Singhrd")
> res0: Array[Iterable[Char]] = Array(Vi, sh, al, S, in, gh, rd)
> scala> grouped_by_size(2, (0 to 4).toSeq )
> res1: Array[Iterable[Int]] = Array(Vector(0, 1), Vector(2, 3), Vector(4))
> scala> grouped_by_size(3, "Vishal Singhrd")
> res2: Array[Iterable[Char]] = Array(Vis, hal, Si, ngh, rd)
> scala> grouped_by_size(3, Array("This","is","my","last","example"))
> res3: Array[Iterable[String]] = Array(WrappedArray(This, is, my), 
> WrappedArray(last, example))
>  * Elements in an array or string, depending on what the data domain is, may 
> have a natural periodicity that has semantic meaning.
>  * Being able to easily divide a large collection into an Array of Arrays or 
> Array of String aids in applying transforms, explodes, and other array 
> functions (zip, etc) at the appropriate level of periodicity for the data 
> domain.
> Pros:
>  * Current methods for creating periodicity require Strings only as well as 
> complex and inefficient regular expressions instead a simpler and more direct 
> solution. The proposed solution would work for both Strings and Arrays.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39307) Proposed spark.SQL.function to return an array of fixed sized collections from a collection (array or string)

Reply via email to