WenliangCao opened a new pull request, #2499:
URL: https://github.com/apache/systemds/pull/2499

   ## Summary
   
   This pull request introduces an initial implementation of the 
PowerTransformer built-in functions in Apache SystemDS.
   
   The implementation follows a fit-and-apply structure, with separate 
functions for estimating transformation parameters and applying the 
transformation to new data.
   
   ## Changes
   
   * Add `powerTransform.dml` for estimating column-wise transformation 
parameters.
   * Add `powerTransformApply.dml` for applying the transformation with 
previously estimated parameters.
   * Implement the Yeo-Johnson transformation for positive, zero, and negative 
input values.
   * Estimate the optimal lambda value independently for each column.
   * Use golden-section search as the current lambda optimization method.
   * Add DML scripts for integration testing.
   * Add an R reference implementation for result validation.
   * Add a Java integration test that compares the SystemDS output with the R 
reference output.
   
   ## Testing
   
   The implementation was tested with:
   
   ```bash
   mvn -Dtest=BuiltinPowerTransformTest test
   ```
   
   Test result:
   
   ```text
   Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   BUILD SUCCESS
   ```
   
   The test workflow performs the following steps:
   
   1. Generate the input matrix in the Java test.
   2. Execute the PowerTransformer implementation in SystemDS.
   3. Execute the equivalent reference implementation in R.
   4. Compare the SystemDS and R output matrices with a numerical tolerance.
   
   ## Current Status
   
   The current implementation establishes the main transformation pipeline and 
verifies the numerical correctness of the Yeo-Johnson transformation.
   
   The following components are currently available:
   
   * Column-wise lambda estimation.
   * Yeo-Johnson transformation.
   * Separate fit and apply functions.
   * R-based reference validation.
   * Java integration testing through Maven.
   
   ## Current Limitations
   
   * Lambda estimation currently uses golden-section search.
   * A Brent-based optimization method is planned for the final implementation.
   * Box-Cox transformation is not yet implemented.
   * Additional edge cases and numerical stability tests are still required.
   * More datasets and end-to-end experiments will be added.
   * Comparisons with existing scaling methods will be completed during the 
final project phase.
   
   ## Future Work
   
   * Replace or extend golden-section search with Brent's optimization method.
   * Add support for the Box-Cox transformation.
   * Add standardization after the power transformation.
   * Extend tests for boundary lambda values, constant columns, zero values, 
and mixed-sign inputs.
   * Add numerical comparisons with scikit-learn.
   * Evaluate PowerTransformer against existing scaling methods for regression, 
classification, and clustering.
   * Add user documentation and usage examples.
   
   ## Related Issue
   
   SYSTEMDS-3863
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to