sekikn opened a new pull request, #1350:
URL: https://github.com/apache/bigtop/pull/1350

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'BIGTOP-3638: Your PR title ...'.
   -->
   
   ### Description of PR
   
   This PR improves maintainability and readability by introducing the 
DataFrame API and the spark.ml library instead of RDD and spark.mllib.
   
   * Replace most of the RDD-based logic with the equivalent DataFrame-based one
   * Replace the usage of spark.mllib.recommendation.ALS with 
spark.ml.recommendation.ALS
   * Adopt Parquet for intermediate file format instead of SequenceFile
   * Fix indents and spaces by applying formatter
   
   This PR supersedes #1343.
   
   ### How was this patch tested?
   
   Build bigtop-data-generator and install it into the local maven repository:
   
   ```
   $ cd bigtop-data-generators
   $ ../gradlew clean publishToMavenLocal
   $ cd -
   ```
   
   Build bps-spark:
   
   ```
   $ cd bigtop-bigpetstore/bigpetstore-spark
   $ ../../gradlew clean shadowJar
   ```
   
   Run steps in accordance with bigtop-bigpetstore/bigpetstore-spark/README.md. 
Generate test data first:
   
   ```
   $ spark-submit --master local[*] --class 
org.apache.bigtop.bigpetstore.spark.generator.SparkDriver 
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar generated_data 10 1000 
365.0 345
   $ head generated_data/transactions/part-00000 
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06 
09:04:09 JST 2025,unitPrice=112.8;quantity=120.0;color=multicolor 
(solids);price=13536.0;category=poop bags;brand=Happy Pup;recycled 
material=true;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06 
09:04:09 JST 
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Senior;category=dry
 cat food;brand=Feisty Feline;organic=true;hairball 
management=false;lifestyle=Indoor;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,1,Fri Jun 06 
09:04:09 JST 
2025,unitPrice=3.17;quantity=30.0;price=95.1;meat=Rabbit;lifestage=Adult;grain=Rice;category=dry
 dog food;brand=Wellfed;organic=false;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,2,Wed Jun 18 
20:31:32 JST 
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Kitten;category=dry
 cat food;brand=Feisty Feline;organic=true;hairball 
management=false;lifestyle=Outdoor;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12 
02:46:42 JST 
2025,unitPrice=2.92;quantity=30.0;price=87.6;meat=Salmon;lifestage=Puppy;grain=Rice;category=dry
 dog food;brand=Happy Pup;organic=false;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12 
02:46:42 JST 2025,unitPrice=152.4;quantity=120.0;color=multicolor 
(solids);price=18288.0;category=poop bags;brand=Dog Days;recycled material=true;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12 
02:46:42 JST 
2025,unitPrice=3.146000000000001;quantity=7.0;price=22.022000000000006;meat=Turkey;lifestage=Senior;category=dry
 cat food;brand=Feisty Feline;organic=true;hairball 
management=false;lifestyle=Outdoor;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12 
02:46:42 JST 
2025,unitPrice=1.73;clumping=true;quantity=14.0;material=pellets;price=24.22;category=kitty
 litter;brand=Feisty Feline;odor control=true;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,3,Sat Jul 12 
02:46:42 JST 
2025,unitPrice=152.4;quantity=120.0;color=designs;price=18288.0;category=poop 
bags;brand=Dog Days;recycled material=true;
   2,20746,Suitland,MD,999,Coreen,Kipling,20003,Washington,DC,4,Mon Aug 04 
22:07:34 JST 
2025,unitPrice=2.77;quantity=30.0;price=83.1;meat=Lamb;lifestage=Adult;grain=Rice;category=dry
 dog food;brand=Happy Pup;organic=false;
   ```
   
   Then run ETL:
   
   ```
   $ spark-submit --master local[*] --class 
org.apache.bigtop.bigpetstore.spark.etl.SparkETL 
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar generated_data 
transformed_data
   $ spark-shell 
   
   ...
   
   scala> spark.read.parquet("transformed_data/transactions").show()
   +----------+-------------+-------+-------------------+---------+
   |customerId|transactionId|storeId|           dateTime|productId|
   +----------+-------------+-------+-------------------+---------+
   |       499|            1|      8|2025-08-19 12:14:38|      729|
   |       499|            2|      8|2025-11-24 01:47:16|       23|
   |       499|            3|      8|2025-11-25 22:47:40|      729|
   |       499|            4|      8|2026-02-05 12:55:25|       40|
   |       499|            5|      8|2026-02-15 14:09:59|      729|
   |       499|            6|      8|2026-03-16 03:34:49|       23|
   |       499|            7|      8|2026-03-30 20:02:31|       23|
   |       498|            4|      9|2025-06-18 18:40:40|      262|
   |       498|            4|      9|2025-06-18 18:40:40|       54|
   |       498|            4|      9|2025-06-18 18:40:40|      262|
   |       498|            4|      9|2025-06-18 18:40:40|      257|
   |       498|            4|      9|2025-06-18 18:40:40|       40|
   |       498|            4|      9|2025-06-18 18:40:40|      262|
   |       498|            4|      9|2025-06-18 18:40:40|      729|
   |       498|            5|      9|2025-07-04 11:41:16|       54|
   |       498|            5|      9|2025-07-04 11:41:16|       23|
   |       498|            5|      9|2025-07-04 11:41:16|      262|
   |       498|            5|      9|2025-07-04 11:41:16|       54|
   |       498|            5|      9|2025-07-04 11:41:16|      257|
   |       498|            6|      9|2025-07-11 08:30:43|       23|
   +----------+-------------+-------+-------------------+---------+
   only showing top 20 rows
   ```
   
   Calculate statistics:
   
   ```
   $ spark-submit --master local[*] --class 
org.apache.bigtop.bigpetstore.spark.analytics.PetStoreStatistics 
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar transformed_data 
PetStoreStats.json
   $ jq .totalTransactions PetStoreStats.json
   58322
   ```
   
   Run recommendation:
   
   ```
   $ spark-submit --master local[2] --class 
org.apache.bigtop.bigpetstore.spark.analytics.RecommendProducts 
build/libs/bigpetstore-spark-3.5.0-SNAPSHOT-all.jar transformed_data 
recommendations.json
   $ jq .recommendations[0] recommendations.json
   {
     "customerId": 0,
     "productIds": [
       207,
       62,
       828,
       1212,
       709
     ]
   }
   ```
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'BIGTOP-3638. Your PR title ...')?
   - [x] Make sure that newly added files do not have any licensing issues. 
When in doubt refer to https://www.apache.org/licenses/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to