[ 
https://issues.apache.org/jira/browse/SPARK-54179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Boumalhab updated SPARK-54179:
------------------------------------------
    Description: 
Implement support for tuple sketches in Apache Spark to enable approximate set 
cardinality, frequency, and similarity computations over multiple dimensions 
efficiently. The feature should:
 * Integrate tuple sketches with Spark’s DataFrame and RDD APIs.

 * Provide functions for creating, updating, and querying tuple sketches.

 * Support common sketch operations such as union, intersection, and 
cardinality estimation.

 * Ensure compatibility with Spark SQL and allow usage within DataFrame 
transformations and aggregations.

 * Include unit and integration tests validating accuracy and performance.

 * Provide documentation and examples for developers.

*Acceptance Criteria:*

1. Sketches support aggregation and merging operations.

2. Queries return approximate cardinalities or other statistics with expected 
error bounds.

3. Performance benchmarks show scalability for large datasets.

4. Documentation includes API usage examples.

  was:
Implement support for tuple sketches in Apache Spark to enable approximate set 
cardinality, frequency, and similarity computations over multiple dimensions 
efficiently. The feature should:
 * Integrate tuple sketches with Spark’s DataFrame and RDD APIs.

 * Provide functions for creating, updating, and querying tuple sketches.

 * Support common sketch operations such as union, intersection, and 
cardinality estimation.

 * Ensure compatibility with Spark SQL and allow usage within DataFrame 
transformations and aggregations.

 * Include unit and integration tests validating accuracy and performance.

 * Provide documentation and examples for developers.

*Acceptance Criteria:*
 # Sketches support aggregation and merging operations.

 # Queries return approximate cardinalities or other statistics with expected 
error bounds.

 # Performance benchmarks show scalability for large datasets.

 # Documentation includes API usage examples.


> Add Native Support for Apache Tuple Sketches
> --------------------------------------------
>
>                 Key: SPARK-54179
>                 URL: https://issues.apache.org/jira/browse/SPARK-54179
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Christopher Boumalhab
>            Priority: Major
>              Labels: pull-request-available
>
> Implement support for tuple sketches in Apache Spark to enable approximate 
> set cardinality, frequency, and similarity computations over multiple 
> dimensions efficiently. The feature should:
>  * Integrate tuple sketches with Spark’s DataFrame and RDD APIs.
>  * Provide functions for creating, updating, and querying tuple sketches.
>  * Support common sketch operations such as union, intersection, and 
> cardinality estimation.
>  * Ensure compatibility with Spark SQL and allow usage within DataFrame 
> transformations and aggregations.
>  * Include unit and integration tests validating accuracy and performance.
>  * Provide documentation and examples for developers.
> *Acceptance Criteria:*
> 1. Sketches support aggregation and merging operations.
> 2. Queries return approximate cardinalities or other statistics with expected 
> error bounds.
> 3. Performance benchmarks show scalability for large datasets.
> 4. Documentation includes API usage examples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to