Awesome. Thanks Wes. I have now initiated the vote for both projects.
Best, Jorge On Sat, Jul 10, 2021 at 1:26 PM Wes McKinney <wesmck...@gmail.com> wrote: > The process for updating the website is described on > > https://incubator.apache.org/guides/website.html > > It looks like you need to add the new entries to the index.xml file > and then trigger a website build (which should be triggered by changes > to SVN, but if not you can trigger one manually through Jenkins). > > After the new IP clearance pages are visible you should send an IP > clearance lazy consensus vote to gene...@incubator.apache.org like > > > https://lists.apache.org/thread.html/r319b85f0f24f9b0529865387ccfe1b2a00a16f394a48144ba25c3225%40%3Cgeneral.incubator.apache.org%3E > > On Sat, Jul 10, 2021 at 7:48 AM Jorge Cardoso Leitão > > <jorgecarlei...@gmail.com> wrote: > > > > Thanks a lot Wes, > > > > I am not sure how to proceed from here: > > > > 1. how do we generate the html from the xml? I.e. > > https://incubator.apache.org/ip-clearance/arrow-rust-ballista.html > > 2. how do I trigger the the process to start? can I just email the > > incubator with the proposal? > > > > Best, > > Jorge > > > > > > > > On Mon, Jul 5, 2021 at 10:38 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > Great, thanks for the update and pushing this forward. Let us know if > > > you need help with anything. > > > > > > On Sun, Jul 4, 2021 at 8:26 PM Jorge Cardoso Leitão > > > <jorgecarlei...@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > Wes and Neils, > > > > > > > > Thank you for your feedback and offer. I have created the two .xml > > > reports: > > > > > > > > > > > > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-arrow.xml > > > > > > > > http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-rust-experimental-parquet.xml > > > > > > > > I based them on the report for Ballista. I also requested, on the PRs > > > > [1,2], clarification wrt to every contributors' contributions to > each. > > > > > > > > Best, > > > > Jorge > > > > > > > > [1] https://github.com/apache/arrow-experimental-rs-arrow2/pull/1 > > > > [2] https://github.com/apache/arrow-experimental-rs-parquet2/pull/1 > > > > > > > > > > > > > > > > On Mon, Jun 7, 2021 at 11:55 PM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > On Sun, Jun 6, 2021 at 1:47 AM Jorge Cardoso Leitão > > > > > <jorgecarlei...@gmail.com> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > Thanks a lot for your feedback. I agree with all the arguments > put > > > > > forward, > > > > > > including Andrew's point about the large change. > > > > > > > > > > > > I tried a gradual 4 months ago, but it was really difficult and I > > > gave > > > > > up. > > > > > > I estimate that the work involved is half the work of writing > > > parquet2 > > > > > and > > > > > > arrow2 in the first place. The internal dependency on ArrayData > (the > > > main > > > > > > culprit of the unsafe) on arrow-rs is so prevalent that all core > > > > > components > > > > > > need to be re-written from scratch (IPC, FFI, IO, > array/transform/*, > > > > > > compute, SIMD). I personally do not have the motivation to do it, > > > though. > > > > > > > > > > > > Jed, the public API changes are small for end users. A typical > > > migration > > > > > is > > > > > > [1]. I agree that we can further reduce the change-set by keeping > > > legacy > > > > > > interfaces available. > > > > > > > > > > > > Andy, on my machine, the current benchmarks on query 1 yield: > > > > > > > > > > > > type, master (ms), PR [2] for arrow2+parquet2 (ms) > > > > > > memory (-m): 332.9, 239.6 > > > > > > load (the initial time in -m with --format parquet): 5286.0, > 3043.0 > > > > > > parquet format: 1316.1, 930.7 > > > > > > tbl format: 5297.3, 5383.1 > > > > > > > > > > > > i.e. I am observing some improvements. Queries with joins are > still > > > > > slower. > > > > > > The pruning of parquet groups and pages based on stats are not > yet > > > > > there; I > > > > > > am working on them. > > > > > > > > > > > > I agree that this should go through IP clearance. I will start > this > > > > > > process. My thinking would be to create two empty repos on > apache/*, > > > and > > > > > > create 2 PRs from the main branches of each of my repos to those > > > repos, > > > > > and > > > > > > only merge them once IP is cleared. Would that be a reasonable > > > process, > > > > > Wes? > > > > > > > > > > This sounds plenty fine to me — I'm happy to assist with the IP > > > > > clearance process having done it several times in the past. I don't > > > > > have an opinion about the names, but having experimental- in the > name > > > > > sounds in line with the previous discussion we had about this. > > > > > > > > > > > Names: arrow-experimental-rs2 and > arrow-experimental-rs-parquet2, or? > > > > > > > > > > > > Best, > > > > > > Jorge > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://github.com/apache/arrow-datafusion/pull/68/files#diff-2ec0d66fd16c73ff72a23d40186944591e040507c731228ad70b4e168e2a4660 > > > > > > [2] https://github.com/apache/arrow-datafusion/pull/68 > > > > > > > > > > > > > > > > > > On Fri, May 28, 2021 at 5:22 AM Josh Taylor < > joshuatayl...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > > I played around with it, for my use case I really like the new > way > > > of > > > > > > > writing CSVs, it's much more obvious. I love the > > > `read_stream_metadata` > > > > > > > function as well. > > > > > > > > > > > > > > I'm seeing a very slight speed (~8ms) improvement on my end, > but I > > > > > read a > > > > > > > bunch of files in a directory and spit out a CSV, the > bottleneck > > > is the > > > > > > > parsing of lots of files, but it's pretty quick per file. > > > > > > > > > > > > > > old: > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 > > > > > 120224 > > > > > > > bytes took 1ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 > > > > > 123144 > > > > > > > bytes took 1ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10 > > > > > > > 17127928 bytes took 159ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11 > > > > > > > 17127144 bytes took 160ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12 > > > > > > > 17130352 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13 > > > > > > > 17128544 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14 > > > > > > > 17128664 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15 > > > > > > > 17128328 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16 > > > > > > > 17129288 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17 > > > > > > > 17131056 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18 > > > > > > > 17130344 bytes took 158ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19 > > > > > > > 17128432 bytes took 160ms > > > > > > > > > > > > > > new: > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_0 > > > > > 120224 > > > > > > > bytes took 1ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_1 > > > > > 123144 > > > > > > > bytes took 1ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_10 > > > > > > > 17127928 bytes took 157ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_11 > > > > > > > 17127144 bytes took 152ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_12 > > > > > > > 17130352 bytes took 154ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_13 > > > > > > > 17128544 bytes took 153ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_14 > > > > > > > 17128664 bytes took 154ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_15 > > > > > > > 17128328 bytes took 153ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_16 > > > > > > > 17129288 bytes took 152ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_17 > > > > > > > 17131056 bytes took 153ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_18 > > > > > > > 17130344 bytes took 155ms > > > > > > > > /home/josh/staging/019c4715-3200-48fa-0000-4105000cd71e/data_0_0_19 > > > > > > > 17128432 bytes took 153ms > > > > > > > > > > > > > > I'm going to chunk the dirs to speed up the reads and throw it > > > into a > > > > > par > > > > > > > iter. > > > > > > > > > > > > > > On Fri, 28 May 2021 at 09:09, Josh Taylor < > joshuatayl...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > I've been using arrow/arrow-rs for a while now, my use case > is to > > > > > parse > > > > > > > > Arrow streaming files and convert them into CSV. > > > > > > > > > > > > > > > > Rust has been an absolute fantastic tool for this, the > > > performance is > > > > > > > > outstanding and I have had no issues using it for my use > case. > > > > > > > > > > > > > > > > I would be happy to test out the branch and let you know > what the > > > > > > > > performance is like, as I was going to improve the current > > > > > implementation > > > > > > > > that i have for the CSV writer, as it takes a while for > bigger > > > > > datasets > > > > > > > > (multi-GB). > > > > > > > > > > > > > > > > Josh > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 27 May 2021 at 22:49, Jed Brown <j...@jedbrown.org> > > > wrote: > > > > > > > > > > > > > > > >> Andy Grove <andygrov...@gmail.com> writes: > > > > > > > >> > > > > > > > > >> > Looking at this purely from the DataFusion/Ballista point > of > > > view, > > > > > > > what > > > > > > > >> I > > > > > > > >> > would be interested in would be having a branch of DF that > > > uses > > > > > arrow2 > > > > > > > >> and > > > > > > > >> > once that branch has all tests passing and can run queries > > > with > > > > > > > >> performance > > > > > > > >> > that is at least as good as the original arrow crate, > then cut > > > > > over. > > > > > > > >> > > > > > > > > >> > However, for developers using the arrow APIs directly, I > don't > > > > > see an > > > > > > > >> easy > > > > > > > >> > path. We either try and gradually PR the changes in (which > > > seems > > > > > > > really > > > > > > > >> > hard given that there are significant changes to APIs and > > > internal > > > > > > > data > > > > > > > >> > structures) or we port some portion of the existing tests > > > over to > > > > > > > arrow2 > > > > > > > >> > and then make that the official crate once all test pass. > > > > > > > >> > > > > > > > >> How feasible would it be to make a legacy module in arrow2 > that > > > > > would > > > > > > > >> enable (some large subset of) existing arrow users to try > arrow2 > > > > > after > > > > > > > >> adjusting their use statements? (That is, implement the > > > > > public-facing > > > > > > > >> legacy interfaces in terms of arrow2's new, safe interface.) > > > This > > > > > would > > > > > > > >> make it easier to test with DataFusion/Ballista and external > > > users > > > > > of > > > > > > > the > > > > > > > >> current arrow crate, then cut over and let those packages > update > > > > > > > >> incrementally from legacy to modern arrow2. > > > > > > > >> > > > > > > > >> I think it would be okay to tolerate some performance > > > degradation > > > > > when > > > > > > > >> working through these legacy interfaces,so long as there was > > > > > confidence > > > > > > > >> that modernizing the callers would recover the performance > (as > > > tests > > > > > > > have > > > > > > > >> been showing). > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >