sekikn opened a new pull request, #1350:
URL: https://github.com/apache/bigtop/pull/1350
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'BIGTOP-3638: Your PR title ...'.
-->
### Description of PR
This PR improves maintainability and readability by introducing the
DataFrame API and the spark.ml library instead of RDD and spark.mllib.
* Replace most of the RDD-based logic with the equivalent DataFrame-based one
* Replace the usage of spark.mllib.recommendation.ALS with
spark.ml.recommendation.ALS
* Adopt Parquet for intermediate file format instead of SequenceFile
* Fix indents and spaces by applying formatter
This PR supersedes #1343.
### How was this patch tested?
Build bigtop-data-generator and install it into the local maven repository:
```
$ cd bigtop-data-generators
$ ../gradlew clean publishToMavenLocal
$ cd -
```
Build bps-spark:
```
$ cd bigtop-bigpetstore/bigpetstore-spark
$ ../../gradlew clean shadowJar
```
Run steps in accordance with bigtop-bigpetstore/bigpetstore-spark/README.md.
Generate test data first:
```
$ spark-submit --master local[*] --class
org.apache.bigtop.bigpetstore.spark.generator.SparkDriver
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar generated_data 10 1000
365.0 345
$ head generated_data/transactions/part-00000
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06
09:04:09 JST 2025,unitPrice=112.8;quantity=120.0;color=multicolor
(solids);price=13536.0;category=poop bags;brand=Happy Pup;recycled
material=true;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06
09:04:09 JST
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Senior;category=dry
cat food;brand=Feisty Feline;organic=true;hairball
management=false;lifestyle=Indoor;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06
09:04:09 JST
2025,unitPrice=3.17;quantity=30.0;price=95.1;meat=Rabbit;lifestage=Adult;grain=Rice;category=dry
dog food;brand=Wellfed;organic=false;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,2,Wed Jun 18
20:31:32 JST
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Kitten;category=dry
cat food;brand=Feisty Feline;organic=true;hairball
management=false;lifestyle=Outdoor;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12
02:46:42 JST
2025,unitPrice=2.92;quantity=30.0;price=87.6;meat=Salmon;lifestage=Puppy;grain=Rice;category=dry
dog food;brand=Happy Pup;organic=false;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12
02:46:42 JST 2025,unitPrice=152.4;quantity=120.0;color=multicolor
(solids);price=18288.0;category=poop bags;brand=Dog Days;recycled material=true;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12
02:46:42 JST
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Senior;category=dry
cat food;brand=Feisty Feline;organic=true;hairball
management=false;lifestyle=Outdoor;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12
02:46:42 JST
2025,unitPrice=1.73;clumping=true;quantity=14.0;material=pellets;price=24.22;category=kitty
litter;brand=Feisty Feline;odor control=true;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12
02:46:42 JST
2025,unitPrice=152.4;quantity=120.0;color=designs;price=18288.0;category=poop
bags;brand=Dog Days;recycled material=true;
2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,4,Mon Aug 04
22:07:34 JST
2025,unitPrice=2.77;quantity=30.0;price=83.1;meat=Lamb;lifestage=Adult;grain=Rice;category=dry
dog food;brand=Happy Pup;organic=false;
```
Then run ETL:
```
$ spark-submit --master local[*] --class
org.apache.bigtop.bigpetstore.spark.etl.SparkETL
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar generated_data
transformed_data
$ spark-shell
...
scala> spark.read.parquet("transformed_data/transactions").show()
+----------+-------------+-------+-------------------+---------+
|customerId|transactionId|storeId| dateTime|productId|
+----------+-------------+-------+-------------------+---------+
| 499| 1| 8|2025-08-19 12:14:38| 729|
| 499| 2| 8|2025-11-24 01:47:16| 23|
| 499| 3| 8|2025-11-25 22:47:40| 729|
| 499| 4| 8|2026-02-05 12:55:25| 40|
| 499| 5| 8|2026-02-15 14:09:59| 729|
| 499| 6| 8|2026-03-16 03:34:49| 23|
| 499| 7| 8|2026-03-30 20:02:31| 23|
| 498| 4| 9|2025-06-18 18:40:40| 262|
| 498| 4| 9|2025-06-18 18:40:40| 54|
| 498| 4| 9|2025-06-18 18:40:40| 262|
| 498| 4| 9|2025-06-18 18:40:40| 257|
| 498| 4| 9|2025-06-18 18:40:40| 40|
| 498| 4| 9|2025-06-18 18:40:40| 262|
| 498| 4| 9|2025-06-18 18:40:40| 729|
| 498| 5| 9|2025-07-04 11:41:16| 54|
| 498| 5| 9|2025-07-04 11:41:16| 23|
| 498| 5| 9|2025-07-04 11:41:16| 262|
| 498| 5| 9|2025-07-04 11:41:16| 54|
| 498| 5| 9|2025-07-04 11:41:16| 257|
| 498| 6| 9|2025-07-11 08:30:43| 23|
+----------+-------------+-------+-------------------+---------+
only showing top 20 rows
```
Calculate statistics:
```
$ spark-submit --master local[*] --class
org.apache.bigtop.bigpetstore.spark.analytics.PetStoreStatistics
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar transformed_data
PetStoreStats.json
$ jq .totalTransactions PetStoreStats.json
58322
```
Run recommendation:
```
$ spark-submit --master local[2] --class
org.apache.bigtop.bigpetstore.spark.analytics.RecommendProducts
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar transformed_data
recommendations.json
$ jq .recommendations[0] recommendations.json
{
"customerId": 0,
"productIds": [
207,
62,
828,
1212,
709
]
}
```
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'BIGTOP-3638. Your PR title ...')?
- [x] Make sure that newly added files do not have any licensing issues.
When in doubt refer to https://www.apache.org/licenses/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]