Hi Verne,

Thanks for the review! Here are my quick thoughts on the two points:


*1. put(String key, String archiveContent) vs. T get(String key)*

The asymmetry here is actually intentional:

   -

   *put:* The archive content is always a serialized JSON String (rather
   than a File, byte[], or InputStream), so accepting a String is
   sufficient. I’ll add an annotation to clarify this.


   -

   *get:* Returning T allows FileArchiveStorage to return a File handle
   directly, enabling the request handler to stream the file straight to the
   response. It also keeps the API flexible for other storage backend handlers
   in the future.

*2. getByPrefix returning List<T>*

That’s a fair concern in general. However, I believe the memory risk is
well under control here:

   -

   This method is specifically used to fetch and update job overview
   information, which is then deserialized into JobDetails and merged into
   a MultipleJobsDetails set.
   -

   These overview objects are very small, and this data would need to be
   fully loaded into memory for merging anyway.

Besides, if we need to handle larger datasets down the road, we can easily
introduce a streaming/iterator-based method if necessary.


Thanks again for looking into this!


Best,

Zihao

Fan Deng <[email protected]> 于2026年5月18日周一 10:54写道:

> Hi Zihao,
>
> Thanks for driving this — the pluggable ArchiveStorage abstraction looks
> like a clean fit for the HistoryServer. I have two questions on the
> ArchiveStorage interface that I'd like to understand better before the vote:
>
> Asymmetric value type between read and write. get / getByPrefix return the
> generic type T, but put hard-codes the value to String:
>
> ···
> T get(String key);
> void put(String key, String archiveContent);
> ···
>
> Could you share the rationale? Making the write side symmetric (either
> also T, or unifying both sides on byte[] / InputStream) would feel more
> consistent and avoid forcing every backend to materialize the archive as a
> String. Is there a specific reason String was chosen for put?
>
> OOM risk of getByPrefix returning List<T>. In production a single prefix
> (e.g. all entries under one job, or under /jobs/) can easily expand to
> thousands of entries with non-trivial JSON payloads. Returning a fully
> materialized List<T> means the whole result set is loaded into heap at
> once, which I'm worried could cause OOM on busy HistoryServers.
>
> Have you considered exposing it as Iterator<T> / CloseableIterator<T> (or
> a Stream<T>) instead? It maps very naturally to RocksDB's prefix iterator,
> and FileArchiveStorage can implement it lazily as well. If there's a
> concrete call site that really needs the full list, it can always do
> Lists.newArrayList(iter) locally.
>
> Other than these two points, +1 from me on the overall direction.
>
> Best, Verne
>
> On 2026/05/09 03:37:08 zihao chen wrote:
> > Hi all,
> >
> > I’d like to start a discussion on FLIP-XXX:
> >
> > *Support Pluggable Storage Backend forHistoryServer*.
> >
> > This FLIP proposes improving the HistoryServer
> > to address excessive *small files* when handling
> > large numbers of archived jobs.
> >
> > [Proposal]
> > Optional *RocksDB-based storage* to reduce
> > small files
> >
> > [Compatibility]
> > Full backward compatibility (FILE as default)
> >
> > The detailed design is described in the
> > FLIP document:
> >
> >
> https://docs.google.com/document/d/1idHu5bq0GOsUuUAEIJSJ2UuekcDjbW0tHLNbsQfugDg/edit?usp=sharing
> >
> > This FLIP is split from the earlier discussion [1].
> >
> > Looking forward to your feedback.
> >
> > [1] https://lists.apache.org/thread/6thlq9c5twyvzmcw7q24nm4q0rcbz5qp
> >
> >
> > Best regards,
> >
> > Zihao Chen
> >
>

Reply via email to