TysonStanley commented on PR #43634: URL: https://github.com/apache/arrow/pull/43634#issuecomment-2295015587
@jonkeane I'm late to adding this, but notably my a team member sent me evidence that the index in data.table can cause problems in reading the parquet back in (`Error: IOError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit`) and it explodes the size of the file (in her example from 400MB to 2GB with the only change being the index). See reprex below: ```r library(data.table) library(arrow) dt<-data.table(x=c(1:1e8), y = round(runif(n=1:1e8, min=1, max=5))) #Looking at rows where y == 3 dt[y == 3,] #Creating a new variable, which is done uniformly across all rows (suggesting the previous row index isn't applicable?) dt[, z := 1] #Save the dt write_parquet(dt, "example.parquet") gc() #Cannot open the dt dt_open<-read_parquet("example.parquet") #Removing indexing that was created when looking at the y==3 subset before saving allows #the file to be opened after re-saving. setindex(dt, NULL) write_parquet(dt, "example2.parquet") dt_open<-read_parquet("example2.parquet") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org