Hi, @Ted:
> Is it possible to prune (unneeded) field(s) so that heap requirement is > lower ? The XmlInputFormat [0] splits the raw data into smaller chunks, which are then further processed. I don't think I can reduce the field's (Tuple2<LongWritable, Text>) sizes. The major difference to Mahout's XmlInputFormat is the compressed file support, which does not seem to exist [1]. @Stephan, @Kurt > - You would rather need MORE managed memory, not less, because the sorter > uses that. > I think the only way is adding more managed memor Ah, okay. Seems like I misunderstood it, but tested with up to 0.8 of a 46 GB RAM allocation anyway. Does that mean, that I have to scale the amount of RAM proportionally to the dataset's size in this case? I'd expected Flink to start caching and slowing down? > - We added the "large record handler" to the sorter for exactly these use > cases. Okay, so spilling to disk is theoretically possible and the crashes should not occur then? > [...] it is thrown during the combing phase which only uses an in-memory > sorter, which doesn't have large record handle mechanism. Are there ways to circumvent this restriction (sorting step?) or otherwise optimize the process? > Can you check in the code whether it is enabled? You'll have to go > through a bit of the code to see that. Although, I'm not deeply involved with Flink's internal sourcecode, I'll try my best to figure that out. Thanks, Sebastian [0] http://paste.gehaxelt.in/?336f8247fa50171e#DSH0poFcVIR29X7lb98qRhUG/jrkKkUrfkUs7ECSyeE= [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-compressed-XML-data-td10985.html