WenliangCao opened a new pull request, #2499: URL: https://github.com/apache/systemds/pull/2499
## Summary This pull request introduces an initial implementation of the PowerTransformer built-in functions in Apache SystemDS. The implementation follows a fit-and-apply structure, with separate functions for estimating transformation parameters and applying the transformation to new data. ## Changes * Add `powerTransform.dml` for estimating column-wise transformation parameters. * Add `powerTransformApply.dml` for applying the transformation with previously estimated parameters. * Implement the Yeo-Johnson transformation for positive, zero, and negative input values. * Estimate the optimal lambda value independently for each column. * Use golden-section search as the current lambda optimization method. * Add DML scripts for integration testing. * Add an R reference implementation for result validation. * Add a Java integration test that compares the SystemDS output with the R reference output. ## Testing The implementation was tested with: ```bash mvn -Dtest=BuiltinPowerTransformTest test ``` Test result: ```text Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 BUILD SUCCESS ``` The test workflow performs the following steps: 1. Generate the input matrix in the Java test. 2. Execute the PowerTransformer implementation in SystemDS. 3. Execute the equivalent reference implementation in R. 4. Compare the SystemDS and R output matrices with a numerical tolerance. ## Current Status The current implementation establishes the main transformation pipeline and verifies the numerical correctness of the Yeo-Johnson transformation. The following components are currently available: * Column-wise lambda estimation. * Yeo-Johnson transformation. * Separate fit and apply functions. * R-based reference validation. * Java integration testing through Maven. ## Current Limitations * Lambda estimation currently uses golden-section search. * A Brent-based optimization method is planned for the final implementation. * Box-Cox transformation is not yet implemented. * Additional edge cases and numerical stability tests are still required. * More datasets and end-to-end experiments will be added. * Comparisons with existing scaling methods will be completed during the final project phase. ## Future Work * Replace or extend golden-section search with Brent's optimization method. * Add support for the Box-Cox transformation. * Add standardization after the power transformation. * Extend tests for boundary lambda values, constant columns, zero values, and mixed-sign inputs. * Add numerical comparisons with scikit-learn. * Evaluate PowerTransformer against existing scaling methods for regression, classification, and clustering. * Add user documentation and usage examples. ## Related Issue SYSTEMDS-3863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
