TysonStanley commented on PR #43634:
URL: https://github.com/apache/arrow/pull/43634#issuecomment-2295015587
@jonkeane I'm late to adding this, but notably my a team member sent me
evidence that the index in data.table can cause problems in reading the parquet
back in (`Error: IOError: Couldn't deserialize thrift: TProtocolException:
Exceeded size limit`) and it explodes the size of the file (in her example from
400MB to 2GB with the only change being the index). See reprex below:
```r
library(data.table)
library(arrow)
dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))
#Looking at rows where y == 3
dt[y == 3,]
#Creating a new variable, which is done uniformly across all rows
(suggesting the previous row index isn't applicable?)
dt[, z := 1]
#Save the dt
write_parquet(dt, "example.parquet")
gc()
#Cannot open the dt
dt_open<-read_parquet("example.parquet")
#Removing indexing that was created when looking at the y==3 subset before
saving allows
#the file to be opened after re-saving.
setindex(dt, NULL)
write_parquet(dt, "example2.parquet")
dt_open<-read_parquet("example2.parquet")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]