TysonStanley commented on PR #43634:
URL: https://github.com/apache/arrow/pull/43634#issuecomment-2295015587

   @jonkeane I'm late to adding this, but notably my a team member sent me 
evidence that the index in data.table can cause problems in reading the parquet 
back in (`Error: IOError: Couldn't deserialize thrift: TProtocolException: 
Exceeded size limit`) and it explodes the size of the file (in her example from 
400MB to 2GB with the only change being the index). See reprex below:
   
   ```r
   library(data.table)
   library(arrow)
   
   dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5)))
   
   #Looking at rows where y == 3
   dt[y == 3,]
   
   #Creating a new variable, which is done uniformly across all rows 
(suggesting the previous row index isn't applicable?)
   dt[, z := 1]
   
   #Save the dt
   write_parquet(dt, "example.parquet")
   gc()
   
   #Cannot open the dt
   dt_open<-read_parquet("example.parquet")
   
   #Removing indexing that was created when looking at the y==3 subset before 
saving allows
   #the file to be opened after re-saving.
   setindex(dt, NULL)
   
   write_parquet(dt, "example2.parquet")
   dt_open<-read_parquet("example2.parquet")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to