hi Animesh, On Thu, Sep 6, 2018 at 12:23 AM Animesh Trivedi <[email protected]> wrote: > > Hi Wes, > > Nice to connect to you too. We are happy to have your input on Albis and > Arrow. Specifically: > > - We understand that Arrow is not a file format, but we chose to evaluate > it in a mix with storage formats as Arrow is designed for in-memory > columnar storage. The "in-memory" aspect of it is closer to flash/NVMe than > disks in terms of performance. And personally I was curious to try out > Arrow :) We coded a simple benchmark (how fast one can materialize values) > because anything more complicated like relational queries would bring > complexity from the underlying SQL engine.
Right, but what you did in your benchmarks was neither in-memory or memory-mapping IIUC -- you are accessing the memory through synchronous Hadoop protobuf RPCs which deeply conflates the results (even if the HDFS nodes are running atop NVMe). Additionally, the Arrow Java library does not even yet support memory mapping (we do in C++), so the only way to fairly evaluate that code right now is to run on RAM-resident data. - Wes > > - Yes, I will make it clear that the performance of Arrow that is evaluated > in the blog is for the less beaten on-heap Java path. > > Now coming to the interesting bit. Arrow storage performance tuning (HDFS > or Crail) that I can help to investigate. This is a good starting point. I > will update you all on the Crail and Arrow mailing lists. Beyond > performance, the multi-file storage model is where I am most interested. It > will help us to explore how different file types (column groups, metadata) > can be mapped to different storage (NVMe, DRAM, 3DXP) types that Crail > supports. I think this is an interesting avenue to explore. > > Wes and Julian - thanks for the discussion. > > Cheers, > -- > Animesh > > On Wed, Sep 5, 2018 at 8:57 PM Julian Hyde <[email protected]> wrote: > > > Animesh, > > > > Thanks for your thoughtful response. > > > > I think we’re now on the same page about the opportunities for > > collaboration. And I saw that Wes posted to this thread too. I hope you > > find ways to make Arrow and Crail work well together. > > > > Julian > > > > > > > On Sep 5, 2018, at 3:49 AM, Animesh Trivedi <[email protected]> > > wrote: > > > > > > Hi Julian, > > > > > > Thanks for posting your thoughts. > > > > > > [As a Crail committer]: We agree that the notion of "we" creates > > confusion. > > > The Crail blog follows the trend in community projects, where a blogpost > > > falls in one of the two categories. The first type where a developer > > talks > > > about recent improvements, features, performance evaluation, etc. The > > > second type is where "a user" presents how they used the system for their > > > use-case. The Albis blog post falls into the second category. We can (and > > > should for future references) definitely categorize and mark it clear > > that > > > way. And we would encourage the community, whoever tries Crail please > > reach > > > out to us to present your story on the Crail blog. Crail is committed to > > > provide the best possible performance to all its users, be it Albis, > > Arrow, > > > ORC, or Parquet. > > > > > > [As a developer of Albis and user of Crail]: I understand your sentiment > > > regarding the format wars, and it is not the aim of Albis to establish > > yet > > > another file format. Albis started as a prototype to quickly "explore" > > > various design choices for storing relational data for a variety of > > > scenarios with high-performance storage/networking devices - the kind of > > > devices Crail targets. This is something that I cannot easily do with > > > Arrow, ORC, or Parquet with HDFS (or something similar) within a > > reasonable > > > effort and time-frame as they all have already chosen certain design > > points > > > and trade-offs. Crail and Albis are not tied (or are preferred over other > > > choices) to each other, though since it is coming from a same set of > > > developers, I can see why the confusion arises. Having said this, I will > > be > > > happy to contribute back to the Arrow community about the findings from > > > Albis, and would appreciate any help with that. I had a brief discussion > > > with Julien Le Dem at last DataWorks summit in San Jose about Albis as > > > well. I have not done a through investigation of Arrow over Crail, but > > > perhaps something that can be picked-up now as a starting point. > > > > > > I hope this clarifies the confusion. We will fix the blog post. > > > > > > Thanks, > > > -- > > > Animesh > > > > > > On Tue, Sep 4, 2018 at 9:59 PM Julian Hyde <[email protected] <mailto: > > [email protected]>> wrote: > > > > > >> I just read the blog post [1] about Crail and file formats. (I have to > > >> declare my interests up front: I have been a huge supporter of Apache > > >> Arrow, and I am a PMC member. I’m speaking here as an Arrow contributor > > and > > >> enthusiast, not as a mentor of Crail.) > > >> > > >> I am a bit troubled about the endorsement of Albis in a Crail blog post. > > >> For example, "we have developed a new file format called Albis”. Since > > the > > >> blog post is not signed, I take it that “We” means the authors of the > > paper > > >> [2] mentioned in the blog post. But I hope that “we” does not mean “we > > as > > >> Crail committers and PMC members". > > >> > > >> I know that there are different forces at play if you work for a > > >> corporation, or are a researcher, or are an idealistic open source. As a > > >> researcher, you need to invent new stuff and prove that it is better > > than > > >> everything that has been done before. > > >> > > >> But I’ve been through the file format wars — ORC vs Parquet — driven in > > >> large part by two competing vendors. It was sickening, and a huge waste > > of > > >> effort. Please, please don’t let this happen again. If you want to make > > >> Crail successful, you should make it absolutely clear to the Arrow, ORC > > and > > >> Parquet communities that you will help to make Crail work as well as it > > >> possibly can > > >> > > >> Also, on paper Albis looks very similar to Arrow, and the performance > > gap > > >> is fairly narrow. If you have found insights that would improve Arrow, I > > >> encourage you to share them and make Arrow better. It may be good > > research > > >> practice to accentuate the differences between the two, but it’s good > > open > > >> source practice to find consensus between technologies, and merge > > >> communities. There is a lot of work to be done, and too few people to > > do it. > > >> > > >> Lastly, I know I seem to be giving mixed messages here. I do believe > > that > > >> content about Crail will help drive engagement and build community > > >> (controversial content even more so). I am delighted that the Crail > > team is > > >> writing blog posts and posting them to Twitter. But be careful not to > > >> alienate communities that could help Crail gain widespread adoption. > > >> > > >> Julian > > >> > > >> [1] http://crail.incubator.apache.org/blog/2018/08/sql-p1.html < > > >> http://crail.incubator.apache.org/blog/2018/08/sql-p1.html < > > http://crail.incubator.apache.org/blog/2018/08/sql-p1.html>> > > >> > > >> [2] https://www.usenix.org/conference/atc18/presentation/trivedi < > > https://www.usenix.org/conference/atc18/presentation/trivedi> < > > >> https://www.usenix.org/conference/atc18/presentation/trivedi < > > https://www.usenix.org/conference/atc18/presentation/trivedi>> > > > >
