Hi I can for sure follow the argument that different design ideas around a problem complex leads to different implementations.
My concern is a little bit different. I assume that the developers are in general more interested in the problem complex than the design. If I am correct such projects will be competing for the same developer, and might find it hard to grow. I respect "internal competition" it can be very fruitful, we just need to make sure that we don´t split a good community into smaller communities that are too small to survive. just my little concern after having read the last couple of emails. rgds jan i. On 29 June 2015 at 20:53, Gavin Li <lyo.ga...@gmail.com> wrote: > Hi Andrew, > > I agree with you. I've updated the proposal to include a little bit more > explanations about the difference with Hadoop. > > Purely pursuing novelty is never our interest. Instead I believe even for > the same problem different design and implementation ideas can make big > difference. I think that's why there are many "internal competitions" in > ASF. Having looked at other systems like Ignite and Geode I believe > Pistachio is still quite different in design and implementation when > solving some common problems like in-memory distributed storage and > co-locating computation and data. > > Thanks, > Gavin Li > > On Fri, Jun 26, 2015 at 12:07 PM, Andrew Purtell <apurt...@apache.org> > wrote: > > > Thanks Gavin. > > > > Please let me suggest that novelty is not a requirement for incubation, > and > > a proposal doesn't need to make claims of novelty to be accepted. > > > > Should the proposal be accepted for incubation, you may find your new > > neighbors at Apache can do X where you weren't aware of it. It will be > > totally up to the new podling if you want to survey the landscape when > > figuring out how to differentiate, but I do recommend it, it may help you > > crystallize a community around a real difference and advantage provided > by > > Pistachio. > > > > > > On Mon, Jun 22, 2015 at 7:54 PM, Gavin Li <lyo.ga...@gmail.com> wrote: > > > > > Hi Andrew, > > > > > > As we described more in > > > > > > > > > http://yahooeng.tumblr.com/post/116291838351/pistachio-co-locate-the-data-and-compute-for > > > , > > > a very common problem we saw in Hadoop use cases is we often need to > > > persist the previous result of one map reduce job onto HDFS, then the > > next > > > day we process the new data together with the previous result. Usually > > the > > > most expensive part is the shuffling part where we need to join the > > > previous data and the new data together. It's so expensive because HDFS > > > doesn't store the data in a partitioned way. So data have to be > > transferred > > > again and again in the shuffling phase. Instead, in Pistachio we do the > > > computation right on top of the partitioned storage layer, so that the > > > previous result is always stored in a partitioned way, so shuffling can > > be > > > avoided. Expensive IO and roundtrips can thus be avoided so that much > > > better performance can be achieved. > > > > > > The other difference is in Pistachio we can do computation based on > > > in-memory storage with data replication. Different from the in-memory > > > computation in Spark, the storage can be in-memory here. > > > > > > Please let me know if I'm not clear enough. > > > > > > Thanks, > > > Gavin Li > > > > > > On Mon, Jun 22, 2015 at 7:53 PM, Andrew Purtell <apurt...@apache.org> > > > wrote: > > > > > > > It was a simple question, and not meant to suggest anything one way > or > > > > other regarding my opinion of this proposal. > > > > > > > > On Monday, June 22, 2015, John D. Ament <johndam...@apache.org> > wrote: > > > > > > > > > On Mon, Jun 22, 2015 at 10:26 PM Andrew Purtell < > apurt...@apache.org > > > > > <javascript:;>> wrote: > > > > > > > > > > > > Pistachio can easily embed computation to the storage layer to > > > > achieve > > > > > > the > > > > > > > best data locality to improve the computation performance > > > > significantly > > > > > > > which is an innovative model comparing with the normal ways > where > > > the > > > > > > > storage and compute are independent to each other. > > > > > > > > > > > > Have you heard of something called Hadoop? > > > > > > > > > > > > > > > > Regardless of whether he has or not - what's your point? The ASF > has > > > > > historically not denied the entry of new projects just because > their > > > > domain > > > > > intersects with another project's. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 18, 2015 at 10:17 AM, Gavin Li <lyo.ga...@gmail.com > > > > > <javascript:;>> wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I want to propose project Pistachio to enter Apache Incubator. > > > > > > > > > > > > > > Below please find the proposal. > > > > > > > > > > > > > > Thanks, > > > > > > > Gavin Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > = Pistachio = > > > > > > > > > > > > > > == Abstract == > > > > > > > > > > > > > > Pistachio is a fault-tolerant low latency distributed storage > > > system > > > > > > which > > > > > > > enables simple embedding the computation to the storage layer > to > > > > > achieve > > > > > > > best data locality. It evolves from Yahoo’s global user profile > > > > storage > > > > > > > system. > > > > > > > > > > > > > > == Proposal == > > > > > > > > > > > > > > Pistachio is a distributed key value store system with fault > > > > tolerance > > > > > > and > > > > > > > consistency guarantee. It supports multiple local storage > engine > > > > > > including > > > > > > > in-memory, kyoto cabinet, rocks DB etc. Pistachio is being used > > as > > > > the > > > > > > user > > > > > > > profile storage for massive scale global ads products in Yahoo > > > > storing > > > > > > 10+ > > > > > > > billion user profiles. The performance and reliability has been > > > well > > > > > > proven > > > > > > > on production. > > > > > > > > > > > > > > Pistachio can easily embed computation to the storage layer to > > > > achieve > > > > > > the > > > > > > > best data locality to improve the computation performance > > > > significantly > > > > > > > which is an innovative model comparing with the normal ways > where > > > the > > > > > > > storage and compute are independent to each other. > > > > > > > > > > > > > > == Background == > > > > > > > > > > > > > > Pistachio is originally designed and optimized for Yahoo’s > large > > > > scale > > > > > > > global open RTB(real-time bidding) use cases where latency is > > > > > > critical(the > > > > > > > whole request needs to be finished within 100ms including > network > > > > round > > > > > > > trips). It stores 10+ billion user profiles in 8 data centers. > > > > > > > > > > > > > > Then because of the great performance and the flexibility of > > local > > > > > > storage > > > > > > > choices, we evolved it to do distributed compute. Rich call > back > > > > > > interfaces > > > > > > > are added to supports easy compute directly on top of the > storage > > > > > system > > > > > > > local to the data partition. This model is totally different > from > > > the > > > > > > > traditional distributed computation model where the storage and > > > > compute > > > > > > are > > > > > > > separated and independent. In the new model we found data > > locality > > > > can > > > > > be > > > > > > > improved significantly and lots of data access round trips can > be > > > > > reduced > > > > > > > in computation, and the performance can be improved > > significantly. > > > > > > > > > > > > > > It was publicly announced in April 2015 and currently being > > hosted > > > in > > > > > > > Github. > > > > > > > > > > > > > > == Rationale == > > > > > > > > > > > > > > As a key value store system Pistachio is unique in terms of low > > > > latency > > > > > > > access with fault tolerance and consistency guarantee. The > > > > reliability, > > > > > > > scalability, fault tolerance and performance has been well > proven > > > in > > > > > > global > > > > > > > large scale revenue supporting production system in Yahoo. > > > > > > > > > > > > > > As a distributed computation system, it’s an innovative model > > where > > > > the > > > > > > > compute layer is introduced on top of the storage layer > natively > > > and > > > > > > > naturally to optimize the data locality of computation. > > > > > > > > > > > > > > Operating the project in “apache way” greatly aligns with the > > > > long-term > > > > > > > vision of this project and can greatly help the development of > > the > > > > > > > community. > > > > > > > > > > > > > > == Current Status == > > > > > > > > > > > > > > Pistachio was open-sourced and announced in April 2015 and > > > currently > > > > > > being > > > > > > > hosted in Github, it was mainly being developed by the team > from > > > > Yahoo > > > > > > and > > > > > > > already attracted lots of external developers (20+ watches and > > > forks > > > > on > > > > > > > github). > > > > > > > > > > > > > > == Meritocracy == > > > > > > > > > > > > > > We plan to build an environment following the Apache > meritocracy > > > > > > > principles. Many companies including Linkedin, GF securities, > > > > Microsoft > > > > > > and > > > > > > > open source communities like deeplearning4j have already > > expressed > > > > > > > interests or accepted the invitations to participate in this > > > project. > > > > > > > > > > > > > > == Community == > > > > > > > > > > > > > > Since the announcement of Pistachio we received lots of > > interests. > > > > And > > > > > > the > > > > > > > concept of embedding computation to storage also got lots of > > > > > > recognitions. > > > > > > > We also started to work with other communities like > > deeplearning4j > > > to > > > > > > build > > > > > > > more application use cases with Pistachio. We believe the > > community > > > > > will > > > > > > > grow fast. > > > > > > > > > > > > > > == Core Developers == > > > > > > > > > > > > > > This project is created by Gavin Li. Core developers are > > currently > > > > > mainly > > > > > > > in Yahoo. > > > > > > > > > > > > > > == Alignment == > > > > > > > > > > > > > > Pistachio depends on many Apache projects and dependencies > > > including > > > > > > Kafka, > > > > > > > Helix, Zookeeper, Curator, Apache Commons, etc. > > > > > > > > > > > > > > == Known Risks == > > > > > > > > > > > > > > === Orphaned Products === > > > > > > > > > > > > > > The risk of Pistachio being orphaned is small because Yahoo > > heavily > > > > > > > invested in this system. It’s the internal storage standard for > > > > Yahoo’s > > > > > > > global ads products and still being expanded. Migration cost > from > > > > this > > > > > > > project is very high. We are also working with external > > communities > > > > > like > > > > > > > deeplearning4j and other companies to expand the applications. > > > > > > > > > > > > > > === Inexperience with Open Source === > > > > > > > > > > > > > > Core developers are experienced open source contributors in > many > > > > > projects > > > > > > > including Druid, Spark, Storm, etc. Pistachio committers will > be > > > > guided > > > > > > by > > > > > > > the mentors with strong Apache open source project backgrounds. > > > > > > > > > > > > > > === Homogeneous Developers === > > > > > > > > > > > > > > The initial committers include developers from several > > institutions > > > > > > > including Microsoft, GF Securities, Linkedin and Yahoo. > > > > > > > > > > > > > > === Reliance on Salaried Developers === > > > > > > > > > > > > > > We work on Pistachio on both salaried time and after hours. > Many > > > > > > developers > > > > > > > from other institutions already accepted the invitation to > > > volunteer > > > > > > > working on Pistachio. > > > > > > > > > > > > > > === Relationships with Other Apache Products === > > > > > > > > > > > > > > As mentioned earlier, Pistachio depends on apache kafka, helix, > > > > > > zookeeper, > > > > > > > curator, etc. > > > > > > > > > > > > > > === A Excessive Fascination with the Apache Brand === > > > > > > > > > > > > > > Generating publicity is not the purpose of this proposal. We > > mainly > > > > > want > > > > > > to > > > > > > > join the ASF in order to increase our contacts and visibility > in > > > the > > > > > open > > > > > > > source world to attract great developers. > > > > > > > > > > > > > > == Document == > > > > > > > > > > > > > > Current documentation can be found here: > > > > > > > https://github.com/yahoo/Pistachio. > > > > > > > > > > > > > > == Initial source == > > > > > > > > > > > > > > Initial source can be found here in the Github repo: > > > > > > > https://github.com/yahoo/Pistachio. > > > > > > > > > > > > > > == External dependencies == > > > > > > > > > > > > > > To the best of our knowledge, here is the list of dependencies: > > > > > > > Rocks DB > > > > > > > ICU4j > > > > > > > Apache Curator > > > > > > > netty > > > > > > > google http client > > > > > > > codahale.metrics > > > > > > > apache helix > > > > > > > apache zookeeper > > > > > > > apache commons > > > > > > > apache thrift > > > > > > > apache kafka > > > > > > > kyoto cabinet (GNU GPL) > > > > > > > google protocol buffer > > > > > > > kryo > > > > > > > slf4j > > > > > > > > > > > > > > To the best of our knowledge, except kyoto cabinet others are > all > > > > > > > distributed under Apache compatible licenses: > > > > > > > BSD > > > > > > > ICU > > > > > > > Apache License 2.0 > > > > > > > MIT > > > > > > > > > > > > > > Kytoto cabinet is under GNU GPL, but it is not a hard necessary > > > > > > dependency > > > > > > > to Pistachio, it’s an optional pluggable storage engine. It’s > > > > designed > > > > > in > > > > > > > the way that it’s totally plugable and very loosely coupled. We > > can > > > > > > easily > > > > > > > remove it in graduation. > > > > > > > > > > > > > > == Required Resources == > > > > > > > > > > > > > > Mailing Lists > > > > > > > > > > > > > > pistachio-user > > > > > > > pistachio-dev > > > > > > > pistachio-commits > > > > > > > pistachio-private (for private PMC discussions) > > > > > > > > > > > > > > Git > > > > > > > > > > > > > > The Pistachio team prefers Git for source version control: > git:// > > > > > > > git.apache.org/pistachio > > > > > > > > > > > > > > Issue Tracking > > > > > > > > > > > > > > JIRA Pistachio (PISTACHIO) > > > > > > > > > > > > > > Other Resources > > > > > > > > > > > > > > Jenkins continuous integration testing > > > > > > > > > > > > > > == Initial Committers == > > > > > > > > > > > > > > Gavin Li <lyo.gavin at gmail dot com> > > > > > > > Lie Yang <lyang at yahoo-inc dot com> > > > > > > > Jay Kim <pitecus at yahoo-inc dot com> > > > > > > > Flavio Junqueira <fpj at apache dot org> > > > > > > > Chihong Liang<chihong.liang at gmail dot com> > > > > > > > Yong Liu<ly7110 at gmail dot com> > > > > > > > Shengwu Yang <yangshengwu at gmail dot com> > > > > > > > > > > > > > > == Affiliations == > > > > > > > > > > > > > > Gavin Li - Yahoo > > > > > > > Flavio Junqueira - Microsoft > > > > > > > Chihong Liang - GF securities > > > > > > > Yong Liu - Yingmi Asset Management Corp. > > > > > > > Lie Yang - Yahoo > > > > > > > Jay Kim - Yahoo > > > > > > > Shengwu Yang - Linkedin China > > > > > > > > > > > > > > == Sponsors == > > > > > > > > > > > > > > === Champion === > > > > > > > > > > > > > > Flavio Junqueira <fpj at apache dot org> > > > > > > > > > > > > > > === Nominated Mentors === > > > > > > > > > > > > > > === Sponsoring Entity === > > > > > > > > > > > > > > The Apache Incubator > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > > > > > > > - Andy > > > > > > > > > > > > Problems worthy of attack prove their worth by hitting back. - > Piet > > > > Hein > > > > > > (via Tom White) > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > > > > > - Andy > > > > > > > > Problems worthy of attack prove their worth by hitting back. - Piet > > Hein > > > > (via Tom White) > > > > > > > > > > > > > > > -- > > Best regards, > > > > - Andy > > > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > > (via Tom White) > > >