[jira] [Commented] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

RJ Nowling (JIRA) Mon, 29 Sep 2014 17:09:52 -0700

    [ 
https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152573#comment-14152573
 ]


RJ Nowling commented on BIGTOP-1366:
------------------------------------

Here's a link to the conference:
http://www.swinflow.org/confs/bdcloud2014/

You can review the Java code in the javaport branch on GitHub:
https://github.com/rnowling/bigpetstore-data-generator/tree/javaport

The Java port currently has:
* a build system with Gradle
* about ~75 classes including unit tests.  Every functional class has a 
corresponding unit test.
* ~4k lines of Java code

For the release, I need to:
* Implement about 4-5 more classes and their corresponding unit tests
* Implement the local command-line driver
* Move the simulation parameters from a class containing constants into an 
external configuration file with a Configuration class
* Run some analytics comparing the Java implementation to the Python 
implementation for correctness
* Write a Hadoop MapReduce or Spark driver to test out the public API and make 
any necessary changes

I expect the Java implementation to be available anywhere from a few weeks to a 
couple of months, depending largely on my travel schedule and time spent on 
finishing my Ph.D.

The design centers around 3 types of data: Stores, Customers, 
PurchasingProfiles, and Transactions.  They are generated in a pipeline of 
Store -> Customers -> PurchasingProfiles-> Transactions.  For each type of 
data,  there are simple classes for data and corresponding generators which 
provide an API to the underlying logic.  The transactions and purchasing 
profiles are the most complex and computationally-intensive components so their 
generated are designed to be instantiated multiple times for parallelization.

I do not specify an on-disk file format -- the driver (local CLI, Hadoop, 
Spark, etc.) will be responsible for writing out the data in a format of its 
choice.

I have a list of several improvements to the math model in the next 6 months or 
so and expect the model to stabilize once those are done.  In the mean time, 
nothing will be removed from the data model but some optional data may be added.


> Updated, Richer Model for Generating Data for BigPetStore 
> ----------------------------------------------------------
>
>                 Key: BIGTOP-1366
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1366
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: blueprints
>    Affects Versions: backlog
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>            Priority: Minor
>   Original Estimate: 8,736h
>  Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow.  BPS's current 
> model for generating customer data is sufficient for basic testing of the 
> Hadoop ecosystem, **but the model is very basic and lacks sufficient 
> complexity for embedding interesting patterns into the data**.  
> As a result, **more complex, scalable testing such as testing clustering 
> algorithms in Mahout on non-trivial data or multidimensional data with 
> factors influencing it** is not currently possible.
> Efforts are currently underway to incrementally improve the current model 
> (see BIGTOP-1271 and BIGTOP-1272).  
> To create a model that can that incorporate **realistic, non-hierarchichal 
> patterns** and input data to generate rich customer/transaction data with 
> interesting correlations will require a re-imagining of the current model and 
> its framework.
> To support the improvements to the model in BigPetStore, I have been working 
> on an **alternative ab initio model, developed from scratch**. Since the 
> development of a new model involves substantial R&D work with more 
> specialized tools (mathematical and plotting libraries), I'm doing the 
> current work outside of BPS using the iPython Notebook environment.  Due to 
> the long time frame, the model will be developed on a separate timeline to 
> prevent slowing the development of BPS.  
> Once the model has stabilized, I will begin incorporating the model into BPS 
> itself.  One option is to implement the model in using Scala for clean 
> integration with **spark** which is likely to play an increasingly important 
> role in the hadoop ecosystem, and thus will be an important part of 
> bigpetstore as a test/blueprint app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

Reply via email to