MichaelChirico commented on PR #46878: URL: https://github.com/apache/arrow/pull/46878#issuecomment-3095283261
> > I basically had Gemini write this (on my free personal account). It did 95% of the work, from one prompt, then I tidied up the results and fixed the tests. It looks reasonably similar to the `dplyr::case_when()` code. I'm not sure arrow/Apache policy on AI-generated code. The part I understand least is the `mask` and the `Expression` class. > > Thanks for being transparent on the use of genAI for the code. The ASF guidance on AI-generated code is [here](https://www.apache.org/legal/generative-tooling.html). > > The biggest risk here is if Gemini reproduces copyrighted code or code under incompatible licenses. My best guess is that it's probably fine in this case given that the generated code has to conform to the existing structure for bindings and in the context of arrow and data.table, as opposed to a totally new project being generated from scratch. IANAL but that is basically how I'm thinking of it -- LLM very likely mostly mirroring the patterns in the existing nearby code. Do you want me to force-push with a `Generated-by` commit message? (it could possibly be done on your end with during the squash-and-merge step in the UI?) > I wonder if having just one data.table function might be a little awkward / folks might be looking for more like this. Yea, perfectly reasonable to expect very limited {data.table} support to be as awkward as it is welcome, and will inevitably lead to some follow-on FRs for broader implementations. I don't have all that much bandwidth to implement/maintain, but I am happy to vibe code some more & act more as a PM here. I think the MVP scope that makes sense is to support all the {data.table} vector functions: `fcoalesce()`, `fifelse()`, `frank()`, `shift()`, `uniqueN()` come to mind as being ~straightforward. > If we're adding more code to maintain, would we want to consider a separate package, as per dtplyr or keep it here? This one I will leave to the maintainers -- the advantage of ballooning {arrow} further is that users like the one in the original issue get the support "for free". FWIW by means of quantitative evidence I see: - 10/48 {arrow} reverse imports also import {data.table} [details for code] - O(2,000) `lang:R` files on GitHub invoking both {arrow} and {data.table} (combinations of `{arrow,data.table}::` and/or `library({arrow,data.table})` (about 1/3-1/2 the quantity for {arrow}+{dplyr}): [1](https://github.com/search?q=lang%3AR%20%2Fdata%5C.table%3A%3A%2F%20%2Farrow%3A%3A%2F&type=code) [2](https://github.com/search?q=lang%3AR+%2Fdata%5C.table%3A%3A%2F+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Farrow%2F&type=code) [3](https://github.com/search?q=lang%3AR+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Fdata%5C.table%2F+%2Farrow%3A%3A%2F&type=code) [4](https://github.com/search?q=lang%3AR+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Fdata%5C.table%2F+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Farrow%2F&type=code) <details> ```r rev_imports = tools::CRAN_package_db() |> subset(Package %in% c("arrow", "data.table"), "Reverse imports", drop=TRUE) |> strsplit(",\\s*") lengths(rev_imports) # [1] 48 1662 length(do.call(intersect, rev_imports)) # [1] 10 ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org