MichaelChirico commented on PR #46878:
URL: https://github.com/apache/arrow/pull/46878#issuecomment-3095283261

   > > I basically had Gemini write this (on my free personal account). It did 
95% of the work, from one prompt, then I tidied up the results and fixed the 
tests. It looks reasonably similar to the `dplyr::case_when()` code. I'm not 
sure arrow/Apache policy on AI-generated code. The part I understand least is 
the `mask` and the `Expression` class.
   > 
   > Thanks for being transparent on the use of genAI for the code. The ASF 
guidance on AI-generated code is 
[here](https://www.apache.org/legal/generative-tooling.html).
   > 
   > The biggest risk here is if Gemini reproduces copyrighted code or code 
under incompatible licenses. My best guess is that it's probably fine in this 
case given that the generated code has to conform to the existing structure for 
bindings and in the context of arrow and data.table, as opposed to a totally 
new project being generated from scratch.
   
   IANAL but that is basically how I'm thinking of it -- LLM very likely mostly 
mirroring the patterns in the existing nearby code. Do you want me to 
force-push with a `Generated-by` commit message? (it could possibly be done on 
your end with during the squash-and-merge step in the UI?)
   
   >  I wonder if having just one data.table function might be a little awkward 
/ folks might be looking for more like this.
   
   Yea, perfectly reasonable to expect very limited {data.table} support to be 
as awkward as it is welcome, and will inevitably lead to some follow-on FRs for 
broader implementations.
   
   I don't have all that much bandwidth to implement/maintain, but I am happy 
to vibe code some more & act more as a PM here. I think the MVP scope that 
makes sense is to support all the {data.table} vector functions: `fcoalesce()`, 
`fifelse()`, `frank()`, `shift()`, `uniqueN()` come to mind as being 
~straightforward.
   
   > If we're adding more code to maintain, would we want to consider a 
separate package, as per dtplyr or keep it here?
   
   This one I will leave to the maintainers -- the advantage of ballooning 
{arrow} further is that users like the one in the original issue get the 
support "for free". FWIW by means of quantitative evidence I see:
   
    - 10/48 {arrow} reverse imports also import {data.table} [details for code]
    - O(2,000) `lang:R` files on GitHub invoking both {arrow} and {data.table} 
(combinations of `{arrow,data.table}::` and/or `library({arrow,data.table})` 
(about 1/3-1/2 the quantity for {arrow}+{dplyr}): 
[1](https://github.com/search?q=lang%3AR%20%2Fdata%5C.table%3A%3A%2F%20%2Farrow%3A%3A%2F&type=code)
 
[2](https://github.com/search?q=lang%3AR+%2Fdata%5C.table%3A%3A%2F+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Farrow%2F&type=code)
 
[3](https://github.com/search?q=lang%3AR+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Fdata%5C.table%2F+%2Farrow%3A%3A%2F&type=code)
 
[4](https://github.com/search?q=lang%3AR+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Fdata%5C.table%2F+%2F%28library%7Crequire%29%5C%28%5B%27%22%5D%3Farrow%2F&type=code)
   
   <details>
   
   ```r
   rev_imports = tools::CRAN_package_db() |>
     subset(Package %in% c("arrow", "data.table"), "Reverse imports", 
drop=TRUE) |>
     strsplit(",\\s*")
   lengths(rev_imports)
   # [1]   48 1662
   length(do.call(intersect, rev_imports))
   # [1] 10
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to