Thank you. I really appreciate your feedback as I don't always know the detailed use case for a feature. (For me, it's mostly "hey, this thing is broken, fix it")
What are the rest of the community thinks? This is a great opportunity to share what you think. My answers inline: On Wed, Jun 12, 2019 at 1:12 AM Julien Laurenceau < julien.laurenc...@pepitedata.com> wrote: > Hi, > > I am not absolutely sure it is not already in a roadmap or supported, but > I would appreciate those two features : > > - First feature : I would also like to be able to use a dedicated > directory in HDFS as a /tmp directory leveraging RAMFS for high performing > checkpoint of Spark Jobs without using Alluxio or Ignite. > My current issue is that the RAMFS is only useful with replication factor > x1 (in order to avoid network). > My default replication factor is x3, but I would need a way to set > replication factor x1 on a specific directory (/tmp) for all new writes > coming to this directory. > Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already > written. > For example, this could be done by specifying the replication factor at > the storage policy level. > In my view this would dramatically improve the interest of the > Lazy-persist storage policy. > I am told LAZY_PERSIST is never considered a completed feature, and two Hadoop distros, CDH and HDP don't support it. But now that I understand the use case, it looks useful now. > > From the Doc > Note 1: The Lazy_Persist policy is useful only for single > replica blocks. For blocks with more than one replicas, all the replicas > will be written to DISK since writing only one of the replicas to RAM_DISK > does not improve the overall performance. > In the current state of HDFS configuration, I only see the following hack > (not tested) to implement such a solution : Configure HDFS replication x1 > as default configuration and use Erasure Coding RS(6,3) for the main > storage by attaching an ec storage policy on all directories except /tmp. > > hdfs ec -setPolicy -path <directory> [-policy <policyName>] > > > > - Second feature: a bandwidth throttling dedicated to the re-replication > in case of a failed datanode. > Something similar to the option dedicated to the balancing algorithm > dfs.datanode.balance.bandwidthPerSecbut only for re-replication. > I am pretty sure I've got people asking about this before a few times. > > Thanks and regards > JL > > Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang <weic...@cloudera.com.invalid> > a écrit : > >> Hi! >> >> I am soliciting feedbacks for HDFS roadmap items and wish list in the >> future Hadoop releases. A community meetup >> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw> >> is happening soon, and perhaps we can use this thread to converge on things >> we should talk about there. >> >> I am aware of several major features that merged into trunk, such as RBF, >> Consistent Standby Serving Reads, as well as some recent features that >> merged into 3.2.0 release (storage policy satisfier). >> >> What else should we be doing? I have a laundry list of supportability >> improvement projects, mostly about improving performance or making >> performance diagnostics easier. I can share the list if folks are >> interested. >> >> Are there things we should do to make developer's life easier or things >> that would be nice to have for downstream applications? I know Sahil Takiar >> made a series of improvements in HDFS for Impala recently, and those >> improvements are applicable to other downstreamers such as HBase. Or would >> it help if we provide more Hadoop API examples? >> >