rszper commented on code in PR #27709: URL: https://github.com/apache/beam/pull/27709#discussion_r1300421093
########## website/www/site/content/en/documentation/transforms/python/elementwise/mltransform.md: ########## @@ -0,0 +1,120 @@ +--- +title: "MLTransform" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# MLTransform for data processing + +{{< localstorage language language-py >}} + + +<table> + <tr> + <td> + <a> + {{< button-pydoc path="apache_beam.ml.transforms" class="MLTransform" >}} + </a> + </td> + </tr> +</table> + + +Use `MLTransform` to apply common machine learning (ML) processing tasks on keyed data. Apache Beam provides ML data processing transformations that you can use with `MLTransform`. For the full list of available data +processing transformations, see the [tft.py file](https://github.com/apache/beam/blob/ab93fb1988051baac6c3b9dd1031f4d68bd9a149/sdks/python/apache_beam/ml/transforms/tft.py#L52) in GitHub. + + +To define a data processing transformation by using `MLTransform`, create instances of data processing transforms with `columns` as input parameters. The data in the specified `columns` is transformed and outputted to the `beam.Row` object. + +The following example demonstrates how to use `MLTransform` to normalize your data between 0 and 1 by using the minimum and maximum values from your entire dataset. `MLTransform` uses the `ScaleTo01` transformation. + + +``` +scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) +with beam.Pipeline() as p: + (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform)) +``` + +In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the with_transform method of `MLTransform` or a list. + +``` +MLTransform(transforms=transforms, write_artifact_location=write_artifact_location) +``` + +The transforms passed to `MLTransform` are applied sequentially on the dataset. `MLTransform` expects a dictionary and return a transformed Row objecst with numpy arrays. Review Comment: ```suggestion The transforms passed to `MLTransform` are applied sequentially on the dataset. `MLTransform` expects a dictionary and returns a transformed row object with NumPy arrays. ``` ########## website/www/site/content/en/documentation/transforms/python/elementwise/mltransform.md: ########## @@ -0,0 +1,120 @@ +--- +title: "MLTransform" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# MLTransform for data processing + +{{< localstorage language language-py >}} + + +<table> + <tr> + <td> + <a> + {{< button-pydoc path="apache_beam.ml.transforms" class="MLTransform" >}} + </a> + </td> + </tr> +</table> + + +Use `MLTransform` to apply common machine learning (ML) processing tasks on keyed data. Apache Beam provides ML data processing transformations that you can use with `MLTransform`. For the full list of available data +processing transformations, see the [tft.py file](https://github.com/apache/beam/blob/ab93fb1988051baac6c3b9dd1031f4d68bd9a149/sdks/python/apache_beam/ml/transforms/tft.py#L52) in GitHub. + + +To define a data processing transformation by using `MLTransform`, create instances of data processing transforms with `columns` as input parameters. The data in the specified `columns` is transformed and outputted to the `beam.Row` object. + +The following example demonstrates how to use `MLTransform` to normalize your data between 0 and 1 by using the minimum and maximum values from your entire dataset. `MLTransform` uses the `ScaleTo01` transformation. + + +``` +scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) +with beam.Pipeline() as p: + (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform)) +``` + +In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the with_transform method of `MLTransform` or a list. Review Comment: ```suggestion In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the `with_transform` method of `MLTransform` or a list. ``` ########## website/www/site/content/en/documentation/transforms/python/elementwise/mltransform.md: ########## @@ -0,0 +1,120 @@ +--- +title: "MLTransform" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# MLTransform for data processing + +{{< localstorage language language-py >}} + + +<table> + <tr> + <td> + <a> + {{< button-pydoc path="apache_beam.ml.transforms" class="MLTransform" >}} + </a> + </td> + </tr> +</table> + + +Use `MLTransform` to apply common machine learning (ML) processing tasks on keyed data. Apache Beam provides ML data processing transformations that you can use with `MLTransform`. For the full list of available data +processing transformations, see the [tft.py file](https://github.com/apache/beam/blob/ab93fb1988051baac6c3b9dd1031f4d68bd9a149/sdks/python/apache_beam/ml/transforms/tft.py#L52) in GitHub. + + +To define a data processing transformation by using `MLTransform`, create instances of data processing transforms with `columns` as input parameters. The data in the specified `columns` is transformed and outputted to the `beam.Row` object. + +The following example demonstrates how to use `MLTransform` to normalize your data between 0 and 1 by using the minimum and maximum values from your entire dataset. `MLTransform` uses the `ScaleTo01` transformation. + + +``` +scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) +with beam.Pipeline() as p: + (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform)) +``` + +In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the with_transform method of `MLTransform` or a list. + +``` +MLTransform(transforms=transforms, write_artifact_location=write_artifact_location) +``` + +The transforms passed to `MLTransform` are applied sequentially on the dataset. `MLTransform` expects a dictionary and return a transformed Row objecst with numpy arrays. +## Examples + +The following examples demonstrate how to to create pipelines that use `MLTransform` to preprocess data. + +MLTransform can do a full pass on the dataset, which is useful when you need to transform a single element only after analyzing the entire dataset. Review Comment: ```suggestion `MLTransform` can do a full pass on the dataset, which is useful when you need to transform a single element only after analyzing the entire dataset. ``` ########## website/www/site/content/en/documentation/transforms/python/elementwise/mltransform.md: ########## @@ -0,0 +1,120 @@ +--- +title: "MLTransform" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# MLTransform for data processing + +{{< localstorage language language-py >}} + + +<table> + <tr> + <td> + <a> + {{< button-pydoc path="apache_beam.ml.transforms" class="MLTransform" >}} + </a> + </td> + </tr> +</table> + + +Use `MLTransform` to apply common machine learning (ML) processing tasks on keyed data. Apache Beam provides ML data processing transformations that you can use with `MLTransform`. For the full list of available data +processing transformations, see the [tft.py file](https://github.com/apache/beam/blob/ab93fb1988051baac6c3b9dd1031f4d68bd9a149/sdks/python/apache_beam/ml/transforms/tft.py#L52) in GitHub. + + +To define a data processing transformation by using `MLTransform`, create instances of data processing transforms with `columns` as input parameters. The data in the specified `columns` is transformed and outputted to the `beam.Row` object. + +The following example demonstrates how to use `MLTransform` to normalize your data between 0 and 1 by using the minimum and maximum values from your entire dataset. `MLTransform` uses the `ScaleTo01` transformation. + + +``` +scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) +with beam.Pipeline() as p: + (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform)) +``` + +In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the with_transform method of `MLTransform` or a list. + +``` +MLTransform(transforms=transforms, write_artifact_location=write_artifact_location) +``` + +The transforms passed to `MLTransform` are applied sequentially on the dataset. `MLTransform` expects a dictionary and return a transformed Row objecst with numpy arrays. +## Examples + +The following examples demonstrate how to to create pipelines that use `MLTransform` to preprocess data. + +MLTransform can do a full pass on the dataset, which is useful when you need to transform a single element only after analyzing the entire dataset. +The first two examples require a full pass over the dataset to complete the data transformation. + +* For the `ComputeAndApplyVocabulary` transform, the transform needs access to all of the unique words in the dataset. +* For the `ScaleTo01` transform, the transform needs to know the minimum and maximum values in the dataset. + +### Example 1 + +This example creates a pipeline that uses `MLTransform` to scale data between 0 and 1. +The example takes a list of ints and converts them into the range of 0 to 1 using the transform `ScaleTo01`. Review Comment: ```suggestion The example takes a list of integers and converts them into the range of 0 to 1 using the transform `ScaleTo01`. ``` ########## website/www/site/content/en/documentation/transforms/python/elementwise/mltransform.md: ########## @@ -0,0 +1,120 @@ +--- +title: "MLTransform" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# MLTransform for data processing + +{{< localstorage language language-py >}} + + +<table> + <tr> + <td> + <a> + {{< button-pydoc path="apache_beam.ml.transforms" class="MLTransform" >}} + </a> + </td> + </tr> +</table> + + +Use `MLTransform` to apply common machine learning (ML) processing tasks on keyed data. Apache Beam provides ML data processing transformations that you can use with `MLTransform`. For the full list of available data +processing transformations, see the [tft.py file](https://github.com/apache/beam/blob/ab93fb1988051baac6c3b9dd1031f4d68bd9a149/sdks/python/apache_beam/ml/transforms/tft.py#L52) in GitHub. + + +To define a data processing transformation by using `MLTransform`, create instances of data processing transforms with `columns` as input parameters. The data in the specified `columns` is transformed and outputted to the `beam.Row` object. + +The following example demonstrates how to use `MLTransform` to normalize your data between 0 and 1 by using the minimum and maximum values from your entire dataset. `MLTransform` uses the `ScaleTo01` transformation. + + +``` +scale_to_z_score_transform = ScaleToZScore(columns=['x', 'y']) +with beam.Pipeline() as p: + (data | MLTransform(write_artifact_location=artifact_location).with_transform(scale_to_z_score_transform)) +``` + +In this example, `MLTransform` receives a value for `write_artifact_location`. `MLTransform` then uses this location value to write artifacts generated by the transform. To pass the data processing transform, you can use either the with_transform method of `MLTransform` or a list. + +``` +MLTransform(transforms=transforms, write_artifact_location=write_artifact_location) +``` + +The transforms passed to `MLTransform` are applied sequentially on the dataset. `MLTransform` expects a dictionary and return a transformed Row objecst with numpy arrays. +## Examples + +The following examples demonstrate how to to create pipelines that use `MLTransform` to preprocess data. + +MLTransform can do a full pass on the dataset, which is useful when you need to transform a single element only after analyzing the entire dataset. +The first two examples require a full pass over the dataset to complete the data transformation. + +* For the `ComputeAndApplyVocabulary` transform, the transform needs access to all of the unique words in the dataset. +* For the `ScaleTo01` transform, the transform needs to know the minimum and maximum values in the dataset. + +### Example 1 + +This example creates a pipeline that uses `MLTransform` to scale data between 0 and 1. +The example takes a list of ints and converts them into the range of 0 to 1 using the transform `ScaleTo01`. + +{{< highlight language="py" file="sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform.py" + class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform.py" mltransform_scale_to_0_1 >}} +{{</ highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output: +{{< /paragraph >}} +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform_test.py" mltransform_scale_to_0_1 >}} +{{< /highlight >}} + + +### Example 2 + +This example creates a pipeline that use `MLTransform` to compute vocabulary on the entire dataset and assign indices to each unique vocabulary item. +It takes a list of strings, computes vocabulary over the entire dataset, and then applies a unique index to each vocabulary item. + + +{{< highlight language="py" file="sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform.py" + class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform.py" mltransform_compute_and_apply_vocabulary >}} +{{</ highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output: +{{< /paragraph >}} +{{< highlight class="notebook-skip" >}} +{{< code_sample "sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform_test.py" mltransform_compute_and_apply_vocab >}} +{{< /highlight >}} + + +The above two examples requires a full pass over the dataset to transform the dataset. For `ComputeAndApplyVocabulary`, all the unqiue words in the dataset needs to be known before transforming the data. For `ScaleTo01`, the minimum and maximum of the dataset needs to be known before transforming the dataset. This is acheived by `MLTransform`. Review Comment: ```suggestion ``` Delete this, because it's been moved to the intro. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
