Nate: Good idea to abstract the interface one level higher....

How about a docker run command ? That is probably the easiest way for Linux 
folks to run one off Java apps nowadays.  

docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output data-dir 
--etc  foo --etc bar

I'm happy to curate such a docker image, I already am doing something like this 
in kube for bigtop-transaction-queue, which continuously pumps data generator 
outputs into a REST endpoint or file
Queue... So it could be extended to support other generators.


> om> <[email protected]> wrote:
> 
> Could picture at some point supporting something like this for non-jvm folk 
> just looking for test/demo data:
> 
> apt-get install bigtop-data-gen
> ~/ $ bigtop-data-gen --scheme weather --size 5GB --output data-dir --etc  foo 
> --etc bar
> 
> 
> 
> -----Original Message-----
> From: jay vyas [mailto:[email protected]] 
> Sent: Sunday, August 30, 2015 5:11 PM
> To: [email protected]
> Subject: Re: Proposal for "BigTop Data Generators"
> 
> Hola nate.  Well, here are the Use cases I know of that I have used the data 
> generators for.
> 
> Dockerfile:
> 
> (1) for testing kubernetes.  For this, I just use transaction-queue docker 
> file.
> (2) for testing GlusterFS small file workloads, maybe with other analytics 
> tools...
> 
> Maven repo
> 
> (3) Java maprduce/ignite/spark applications, which can just add a mvn repo 
> when compiling.  Java developers never add jars through RPM repos.
> 
> RPM/DEB packages:
> 
> I could see people using an RPM/DEB data generator, and I'm not against it.  
> But I simply don't know of any real world projects which *currently* need 
> RPM/Deb packages, which is why I haven't bothered to propose it as a 
> requirement.  Nevertheless linux packages are always a welcome addition if  
> someone wants to create em !
> 
> 
> 
> 
>> On Sun, Aug 30, 2015 at 4:34 PM, <[email protected]> wrote:
>> 
>> Would container be in addition to deb/rpm, or instead of?  If latter 
>> can we do deb/rpm as base then have container either created from them 
>> or directly from artifacts?
>> 
>> On test usage side, seems could probably break up tests into 
>> base/required and then optional/add-on tests/test-suites.  Think 
>> remember seeing mention of certain tests that are failing at times on 
>> certain component(s) anyways in the core builds but don’t mean that 
>> the build is broken, so would make sense to have some clean up around those 
>> anyways.
>> 
>> -----Original Message-----
>> From: RJ Nowling [mailto:[email protected]]
>> Sent: Sunday, August 30, 2015 1:11 PM
>> To: [email protected]
>> Subject: Re: Proposal for "BigTop Data Generators"
>> 
>> I agree with the above. :)
>> 
>> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas 
>> <[email protected]>
>> wrote:
>> 
>>> Hi RJ.
>>> 
>>> Maven repositories and docker containers for the transaction queue 
>>> are good enough IMO.  That will give people a way to compose them in 
>>> different idioms (one for Java folks, another for broader Linux 
>>> audience
>> ).
>>> 
>>> I think the lib designs are fairly intuitive.  I would say that we 
>>> should constrain them all to being written in Java or Groovy to keep 
>>> the bigtop theme of "JVM for everything" :).
>>> 
>>> Any particular questions you have around technical design can be 
>>> followed in a JIRA or else maybe a Readme spec that goes in a  top 
>>> level of the data-generators dir...
>>> 
>>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <[email protected]> wrote:
>>>> 
>>>> I'd like to keep this conversation going.
>>>> 
>>>> So here are a few discussion points:
>>>> 
>>>> 1. How do we want to make the data generators available?  Maven?
>>>> RPMs
>>> and
>>>> Debs?
>>>> 
>>>> For now, I'm using a gradle multi-project build to easily build 
>>>> and
>>> install
>>>> the BPS data generators and its libraries into a local maven repo.
>>>> This makes development easy.  Eventually, I would like to post 
>>>> binaries
>>> through
>>>> Maven for easy integration by users.  RPMs / Debs could be 
>>>> interesting since I use a pattern where the data generators are 
>>>> libraries (to support application integration / parallelization by 
>>>> the host framework) but also provide CLI drivers for local testing.
>>>> 
>>>> 2.  The idea of using the data generators as part of the smoke 
>>>> tests came up.  Since there is concern about making the data 
>>>> generators required, we could offer the blueprints (BigPetStore) 
>>>> as optional smoke tests.  Would that be a good compromise?
>>>> 
>>>> 3.  How will they be maintained?
>>>> 
>>>> I'll certainly add myself to the maintainers list and will be 
>>>> taking responsibility.  I'm happy to have others help as well if 
>>>> anyone wants to
>>>> -- if not, that's cool, too.
>>>> 
>>>> 4. Is anyone interested at all in discussing library APIs and designs?
>>>> What about internal interfaces and such?
>>>> 
>>>> 
>>>> My plan was to add at least one more data generator (weather
>>>> simulator)
>>> to
>>>> bigtop-data-generators in the short term.  However, given the 
>>>> concerns raised by Cos (more discussion needed) and Olaf (don't 
>>>> want to force data generators on unsuspecting users ;) ), I would 
>>>> like to reach some
>>> consensus
>>>> on what people are concerned about and solutions.
>>>> 
>>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik 
>>>> <[email protected]>
>>> wrote:
>>>> 
>>>>> Fine by me. I have linked this thread to the JIRA ticket that RJ
>>> created,
>>>>> so
>>>>> we have a way to connect one to another ;)
>>>>> 
>>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I am not confident that moving important design discussions with 
>>>>>> impact
>>>>> to
>>>>>> the whole project to jira is a good idea.
>>>>>> 
>>>>>> In the current JIRA Traffic storm it is not easy to identify and 
>>>>>> follow
>>>>> important tickets.
>>>>>> 
>>>>>> Please keep discussions on the list or at least, please state on 
>>>>>> this
>>>>> list which Ticket to follow ...
>>>>>> 
>>>>>> Olaf
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik <[email protected]>:
>>>>>>> 
>>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Nive to have data generators in Bigtop.
>>>>>>>> 
>>>>>>>> But please do not include it in bigtop_utils, since this 
>>>>>>>> package is mandatory. Not everyone needs a data generator .
>>>>>>> 
>>>>>>> Yup. And let's move further design discussion to the JIRA!
>>>>>>> 
>>>>>>>> Olaf
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas <
>>> [email protected]
>>>>>> :
>>>>>>>>> 
>>>>>>>>> Publishing the jar to bigtops maven is probably a good first 
>>>>>>>>> step
>>>>> ,Then apps can just include it as needed...?.
>>>>>>>>> 
>>>>>>>>> I'm not against packaging if someone wants packages for this.
>>>>>>>>> Maybe
>>>>> even include it in bigtop util ?
>>>>>>>>> 
>>>>>>>>> Let's move to jira,
>>>>>>>>> 
>>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik 
>>>>>>>>>> <[email protected]>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> It is pretty cool indeed!
>>>>>>>>>> 
>>>>>>>>>> I wonder how it needs to be structured to be:
>>>>>>>>>> - easy to access/use from other components wherever it is 
>>>>>>>>>> needed
>>>>>>>>>> - doesn't interfere with the rest of the stack
>>>>>>>>>> 
>>>>>>>>>> I guess one possible way would be to implement the generator 
>>>>>>>>>> as a
>>>>> set of maven
>>>>>>>>>> artifacts, that could be installed/consumed transparently by 
>>>>>>>>>> just
>>>>> declaring a
>>>>>>>>>> dependency e.g as proposed via top-level component.
>>>>>>>>>> 
>>>>>>>>>> Another way is to have a new package like we do for 
>>>>>>>>>> bigtop-utils
>>>>> and such.
>>>>>>>>>> 
>>>>>>>>>> Perhaps this discussion should be moved to JIRA or shall we
>>>>> continue on the
>>>>>>>>>> dev@ ??
>>>>>>>>>> 
>>>>>>>>>> Cos
>>>>>>>>>> 
>>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote:
>>>>>>>>>>> Hi BigTop,
>>>>>>>>>>> 
>>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to propose 
>>>>>>>>>>> a new
>>>>> component
>>>>>>>>>>> for BigTop: BigTop Data Generators.
>>>>>>>>>>> 
>>>>>>>>>>> BigTop Data Generators would consist of a common set of 
>>>>>>>>>>> libraries
>>>>> for
>>>>>>>>>>> building data generators and three example data generators:
>>>>>>>>>>> 
>>>>>>>>>>> * BigPetStore transaction generator (moved from 
>>>>>>>>>>> BigPetStore)
>>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions with 
>>>>>>>>>>> booths
>>>>> on a
>>>>>>>>>>> showroom floor, at a conference, or at a mall
>>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation
>>>>> (temperature, wind
>>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code.  (From a 
>>>>>>>>>>> model
>>>>> trained on
>>>>>>>>>>> NOAA historical weather data)
>>>>>>>>>>> 
>>>>>>>>>>> We believe that creating a common set of libraries will 
>>>>>>>>>>> have
>>>>> several
>>>>>>>>>>> benefits including:
>>>>>>>>>>> 
>>>>>>>>>>> * Easier for others to build their own data generators
>>>>>>>>>>> * Make data generators smaller and easier to maintain
>>>>>>>>>>> * Share improvements across the data generators
>>>>>>>>>>> 
>>>>>>>>>>> More details on the libraries are below.
>>>>>>>>>>> 
>>>>>>>>>>> BigPetStore will be continue to focus on building  and 
>>>>>>>>>>> maintaining blueprints, powered by the BigTop Data Generators.
>>>>>>>>>>> 
>>>>>>>>>>> Our vision is that we get all of Apache coming to BigTop 
>>>>>>>>>>> for tools
>>>>> for
>>>>>>>>>>> building better, more comprehensive blueprints.  We want to
>>>>> support these
>>>>>>>>>>> efforts through data generators and the initial set of 
>>>>>>>>>>> blueprint
>>>>> we've been
>>>>>>>>>>> building.
>>>>>>>>>>> 
>>>>>>>>>>> If the community is generally in support of this, I can 
>>>>>>>>>>> create a
>>>>> top-level
>>>>>>>>>>> "bigtop-data-generators" directory and put the data 
>>>>>>>>>>> generators and libraries in there.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> RJ
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -------
>>>>>>>>>>> Library details:
>>>>>>>>>>> 
>>>>>>>>>>> So far, I've extracted the following common libraries:
>>>>>>>>>>> 
>>>>>>>>>>> * Samplers -- provides classes for PDFs and various 
>>>>>>>>>>> samplers
>>>>>>>>>>> * Name generator -- data set and samplers for generating 
>>>>>>>>>>> names
>>>>>>>>>>> * Location data set -- data set and classes for US zip 
>>>>>>>>>>> codes,
>>>>> their
>>>>>>>>>>> GPS coordinates, median house hold incomes, and population 
>>>>>>>>>>> sizes
>>>>>>>>>>> * Product generator -- library for enumerating products 
>>>>>>>>>>> from a specification file.  Comes with default 
>>>>>>>>>>> specifications for
>>>>> BigPetStore
>>>>>>>>>>> 
>>>>>>>>>>> I also expect that I'll add libraries for:
>>>>>>>>>>> 
>>>>>>>>>>>  * Particle simulation -- customer movement in a room
>>>>>>>>>>>  * Latent factor model generation -- generate latent 
>>>>>>>>>>> factors and customer weights to create something like MovieLens 
>>>>>>>>>>> data.
>>>>>>>>>>> Used in
>>>>> Bazaar
>>>>>>>>>>> for booth preferences and potentially in BigPetStore for 
>>>>>>>>>>> customer
>>>>> item
>>>>>>>>>>> preferences
>>>>>>>>>>> 
>>>>>>>>>>> Most of these libraries came out of the BigPetStore data 
>>>>>>>>>>> generator
>>>>> but the
>>>>>>>>>>> other generators have been refactored to be based off the 
>>>>>>>>>>> standard
>>>>> set of
>>>>>>>>>>> libraries.
> 
> 
> --
> jay vyas
> 

Reply via email to