Nate: Good idea to abstract the interface one level higher.... How about a docker run command ? That is probably the easiest way for Linux folks to run one off Java apps nowadays.
docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output data-dir --etc foo --etc bar I'm happy to curate such a docker image, I already am doing something like this in kube for bigtop-transaction-queue, which continuously pumps data generator outputs into a REST endpoint or file Queue... So it could be extended to support other generators. > om> <[email protected]> wrote: > > Could picture at some point supporting something like this for non-jvm folk > just looking for test/demo data: > > apt-get install bigtop-data-gen > ~/ $ bigtop-data-gen --scheme weather --size 5GB --output data-dir --etc foo > --etc bar > > > > -----Original Message----- > From: jay vyas [mailto:[email protected]] > Sent: Sunday, August 30, 2015 5:11 PM > To: [email protected] > Subject: Re: Proposal for "BigTop Data Generators" > > Hola nate. Well, here are the Use cases I know of that I have used the data > generators for. > > Dockerfile: > > (1) for testing kubernetes. For this, I just use transaction-queue docker > file. > (2) for testing GlusterFS small file workloads, maybe with other analytics > tools... > > Maven repo > > (3) Java maprduce/ignite/spark applications, which can just add a mvn repo > when compiling. Java developers never add jars through RPM repos. > > RPM/DEB packages: > > I could see people using an RPM/DEB data generator, and I'm not against it. > But I simply don't know of any real world projects which *currently* need > RPM/Deb packages, which is why I haven't bothered to propose it as a > requirement. Nevertheless linux packages are always a welcome addition if > someone wants to create em ! > > > > >> On Sun, Aug 30, 2015 at 4:34 PM, <[email protected]> wrote: >> >> Would container be in addition to deb/rpm, or instead of? If latter >> can we do deb/rpm as base then have container either created from them >> or directly from artifacts? >> >> On test usage side, seems could probably break up tests into >> base/required and then optional/add-on tests/test-suites. Think >> remember seeing mention of certain tests that are failing at times on >> certain component(s) anyways in the core builds but don’t mean that >> the build is broken, so would make sense to have some clean up around those >> anyways. >> >> -----Original Message----- >> From: RJ Nowling [mailto:[email protected]] >> Sent: Sunday, August 30, 2015 1:11 PM >> To: [email protected] >> Subject: Re: Proposal for "BigTop Data Generators" >> >> I agree with the above. :) >> >> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas >> <[email protected]> >> wrote: >> >>> Hi RJ. >>> >>> Maven repositories and docker containers for the transaction queue >>> are good enough IMO. That will give people a way to compose them in >>> different idioms (one for Java folks, another for broader Linux >>> audience >> ). >>> >>> I think the lib designs are fairly intuitive. I would say that we >>> should constrain them all to being written in Java or Groovy to keep >>> the bigtop theme of "JVM for everything" :). >>> >>> Any particular questions you have around technical design can be >>> followed in a JIRA or else maybe a Readme spec that goes in a top >>> level of the data-generators dir... >>> >>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <[email protected]> wrote: >>>> >>>> I'd like to keep this conversation going. >>>> >>>> So here are a few discussion points: >>>> >>>> 1. How do we want to make the data generators available? Maven? >>>> RPMs >>> and >>>> Debs? >>>> >>>> For now, I'm using a gradle multi-project build to easily build >>>> and >>> install >>>> the BPS data generators and its libraries into a local maven repo. >>>> This makes development easy. Eventually, I would like to post >>>> binaries >>> through >>>> Maven for easy integration by users. RPMs / Debs could be >>>> interesting since I use a pattern where the data generators are >>>> libraries (to support application integration / parallelization by >>>> the host framework) but also provide CLI drivers for local testing. >>>> >>>> 2. The idea of using the data generators as part of the smoke >>>> tests came up. Since there is concern about making the data >>>> generators required, we could offer the blueprints (BigPetStore) >>>> as optional smoke tests. Would that be a good compromise? >>>> >>>> 3. How will they be maintained? >>>> >>>> I'll certainly add myself to the maintainers list and will be >>>> taking responsibility. I'm happy to have others help as well if >>>> anyone wants to >>>> -- if not, that's cool, too. >>>> >>>> 4. Is anyone interested at all in discussing library APIs and designs? >>>> What about internal interfaces and such? >>>> >>>> >>>> My plan was to add at least one more data generator (weather >>>> simulator) >>> to >>>> bigtop-data-generators in the short term. However, given the >>>> concerns raised by Cos (more discussion needed) and Olaf (don't >>>> want to force data generators on unsuspecting users ;) ), I would >>>> like to reach some >>> consensus >>>> on what people are concerned about and solutions. >>>> >>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik >>>> <[email protected]> >>> wrote: >>>> >>>>> Fine by me. I have linked this thread to the JIRA ticket that RJ >>> created, >>>>> so >>>>> we have a way to connect one to another ;) >>>>> >>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote: >>>>>> Hi, >>>>>> >>>>>> I am not confident that moving important design discussions with >>>>>> impact >>>>> to >>>>>> the whole project to jira is a good idea. >>>>>> >>>>>> In the current JIRA Traffic storm it is not easy to identify and >>>>>> follow >>>>> important tickets. >>>>>> >>>>>> Please keep discussions on the list or at least, please state on >>>>>> this >>>>> list which Ticket to follow ... >>>>>> >>>>>> Olaf >>>>>> >>>>>> >>>>>> >>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik <[email protected]>: >>>>>>> >>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Nive to have data generators in Bigtop. >>>>>>>> >>>>>>>> But please do not include it in bigtop_utils, since this >>>>>>>> package is mandatory. Not everyone needs a data generator . >>>>>>> >>>>>>> Yup. And let's move further design discussion to the JIRA! >>>>>>> >>>>>>>> Olaf >>>>>>>> >>>>>>>> >>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas < >>> [email protected] >>>>>> : >>>>>>>>> >>>>>>>>> Publishing the jar to bigtops maven is probably a good first >>>>>>>>> step >>>>> ,Then apps can just include it as needed...?. >>>>>>>>> >>>>>>>>> I'm not against packaging if someone wants packages for this. >>>>>>>>> Maybe >>>>> even include it in bigtop util ? >>>>>>>>> >>>>>>>>> Let's move to jira, >>>>>>>>> >>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik >>>>>>>>>> <[email protected]> >>>>> wrote: >>>>>>>>>> >>>>>>>>>> It is pretty cool indeed! >>>>>>>>>> >>>>>>>>>> I wonder how it needs to be structured to be: >>>>>>>>>> - easy to access/use from other components wherever it is >>>>>>>>>> needed >>>>>>>>>> - doesn't interfere with the rest of the stack >>>>>>>>>> >>>>>>>>>> I guess one possible way would be to implement the generator >>>>>>>>>> as a >>>>> set of maven >>>>>>>>>> artifacts, that could be installed/consumed transparently by >>>>>>>>>> just >>>>> declaring a >>>>>>>>>> dependency e.g as proposed via top-level component. >>>>>>>>>> >>>>>>>>>> Another way is to have a new package like we do for >>>>>>>>>> bigtop-utils >>>>> and such. >>>>>>>>>> >>>>>>>>>> Perhaps this discussion should be moved to JIRA or shall we >>>>> continue on the >>>>>>>>>> dev@ ?? >>>>>>>>>> >>>>>>>>>> Cos >>>>>>>>>> >>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote: >>>>>>>>>>> Hi BigTop, >>>>>>>>>>> >>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to propose >>>>>>>>>>> a new >>>>> component >>>>>>>>>>> for BigTop: BigTop Data Generators. >>>>>>>>>>> >>>>>>>>>>> BigTop Data Generators would consist of a common set of >>>>>>>>>>> libraries >>>>> for >>>>>>>>>>> building data generators and three example data generators: >>>>>>>>>>> >>>>>>>>>>> * BigPetStore transaction generator (moved from >>>>>>>>>>> BigPetStore) >>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions with >>>>>>>>>>> booths >>>>> on a >>>>>>>>>>> showroom floor, at a conference, or at a mall >>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation >>>>> (temperature, wind >>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code. (From a >>>>>>>>>>> model >>>>> trained on >>>>>>>>>>> NOAA historical weather data) >>>>>>>>>>> >>>>>>>>>>> We believe that creating a common set of libraries will >>>>>>>>>>> have >>>>> several >>>>>>>>>>> benefits including: >>>>>>>>>>> >>>>>>>>>>> * Easier for others to build their own data generators >>>>>>>>>>> * Make data generators smaller and easier to maintain >>>>>>>>>>> * Share improvements across the data generators >>>>>>>>>>> >>>>>>>>>>> More details on the libraries are below. >>>>>>>>>>> >>>>>>>>>>> BigPetStore will be continue to focus on building and >>>>>>>>>>> maintaining blueprints, powered by the BigTop Data Generators. >>>>>>>>>>> >>>>>>>>>>> Our vision is that we get all of Apache coming to BigTop >>>>>>>>>>> for tools >>>>> for >>>>>>>>>>> building better, more comprehensive blueprints. We want to >>>>> support these >>>>>>>>>>> efforts through data generators and the initial set of >>>>>>>>>>> blueprint >>>>> we've been >>>>>>>>>>> building. >>>>>>>>>>> >>>>>>>>>>> If the community is generally in support of this, I can >>>>>>>>>>> create a >>>>> top-level >>>>>>>>>>> "bigtop-data-generators" directory and put the data >>>>>>>>>>> generators and libraries in there. >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> RJ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ------- >>>>>>>>>>> Library details: >>>>>>>>>>> >>>>>>>>>>> So far, I've extracted the following common libraries: >>>>>>>>>>> >>>>>>>>>>> * Samplers -- provides classes for PDFs and various >>>>>>>>>>> samplers >>>>>>>>>>> * Name generator -- data set and samplers for generating >>>>>>>>>>> names >>>>>>>>>>> * Location data set -- data set and classes for US zip >>>>>>>>>>> codes, >>>>> their >>>>>>>>>>> GPS coordinates, median house hold incomes, and population >>>>>>>>>>> sizes >>>>>>>>>>> * Product generator -- library for enumerating products >>>>>>>>>>> from a specification file. Comes with default >>>>>>>>>>> specifications for >>>>> BigPetStore >>>>>>>>>>> >>>>>>>>>>> I also expect that I'll add libraries for: >>>>>>>>>>> >>>>>>>>>>> * Particle simulation -- customer movement in a room >>>>>>>>>>> * Latent factor model generation -- generate latent >>>>>>>>>>> factors and customer weights to create something like MovieLens >>>>>>>>>>> data. >>>>>>>>>>> Used in >>>>> Bazaar >>>>>>>>>>> for booth preferences and potentially in BigPetStore for >>>>>>>>>>> customer >>>>> item >>>>>>>>>>> preferences >>>>>>>>>>> >>>>>>>>>>> Most of these libraries came out of the BigPetStore data >>>>>>>>>>> generator >>>>> but the >>>>>>>>>>> other generators have been refactored to be based off the >>>>>>>>>>> standard >>>>> set of >>>>>>>>>>> libraries. > > > -- > jay vyas >
