Hi,

@Ted:

> Is it possible to prune (unneeded) field(s) so that heap requirement is
> lower ?

The XmlInputFormat [0] splits the raw data into smaller chunks, which
are then further processed. I don't think I can reduce the field's
(Tuple2<LongWritable, Text>) sizes. The major difference to Mahout's
XmlInputFormat is the compressed file support, which does not seem to
exist [1].

@Stephan, @Kurt

>   - You would rather need MORE managed memory, not less, because the sorter
> uses that.

> I think the only way is adding more managed memor
Ah, okay. Seems like I misunderstood it, but tested with up to 0.8 of a
46 GB RAM allocation anyway. Does that mean, that I have to scale the
amount of RAM proportionally to the dataset's size in this case? I'd
expected Flink to start caching and slowing down?

>  - We added the "large record handler" to the sorter for exactly these use
> cases.

Okay, so spilling to disk is theoretically possible and the crashes
should not occur then?

>  [...] it is thrown during the combing phase which only uses an in-memory 
> sorter, which doesn't have large record handle mechanism.

Are there ways to circumvent this restriction (sorting step?) or
otherwise optimize the process?

> Can you check in the code whether it is enabled? You'll have to go
> through a bit of the code to see that.

Although, I'm not deeply involved with Flink's internal sourcecode, I'll
try my best to figure that out.

Thanks,
Sebastian

[0]
http://paste.gehaxelt.in/?336f8247fa50171e#DSH0poFcVIR29X7lb98qRhUG/jrkKkUrfkUs7ECSyeE=
[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Reading-compressed-XML-data-td10985.html

Reply via email to