On Tue, Mar 22, 2016 at 3:04 AM, Talat Uyarer <ta...@uyarer.com> wrote:
> Hi All, > > I am Talat UYARER. I am PMC member and Commiter at Nutch and Gora. I > have few contributions to Hbase and want to work for HBase in GSoC > 2016. As far as I know, you haven't selected any issue for GSoC. > > I didn't sign up for GSOC Talat. Not sure anyone else did either. Is it too late for us to participate now? > I am wondering is there anybody who can be a mentor for GSoC in HBase? > > I'd mentor you (it'd be easy-peasy -- smile) but I think I've missed the mentor signup deadline. > BTW I talked with Enis Soztutar. He offered some topics for GSoC. These > are: > - He mentioned The Data blocks are stored as PREFIX, FAST_DIFF, etc. > encoding. But these encodings just can use in HFile context. In RPC > and WAL we use KeyValueEncoding for Cell Blocks. He told "You can > improve them or using HFile encodings in RPC and WAL" ( He didn't say > the issue number But I guessed it is HBASE-12883 Support block > encoding based on knowing set of column qualifiers up front) > Sounds like a fine project (Someone was just asking about this offline...) > - HBASE-14379 Replication V2 > - HBASE-8691 High-Throughput Streaming Scan API > - HBASE-3529 Native Solr Indexer for HBase(He just mentioned HBase -> > SOLR indexing. I guess it could be this issue.) > > Could you help me for selecting topics or could you offer another issue ? > > All above are good. Here's a few others made for another context: + Become Jepsen distributed systems test tool expert: run it against HBase and HDFS. Analyze results. E.g. see https://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen + Deep dive on hbase Compactions. Own it. Review current options both the defaults, experimental, and the stale. Build tooling and surface metrics that give better insight on effectiveness of compaction mechanics and policies. Develop tunings and alternate, new policies. For further credit, develop master-orchestrated compaction algorithm. + Reimplement HBase append and increment as write-only with rollup on read or using CRDTs ( https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) + Make the HBase Server async/event driven/SEDA moving it off its current thread-per-request basis + UI: build out more pages and tabs on the HBase master exposing more of our cluster metrics (make the master into a metrics sink). Extra points for views, histograms, or dashboards that are both informative AND pretty (D3, etc.). A good benchmark would be subsuming the Hannibal tool https://github.com/sentric/hannibal + Build an example application on HBase for test and illustration: e.g. use Jimmy Lin's/The Internet Archive https://github.com/lintool/warcbase to load common crawl regular webcrawls https://commoncrawl.org/ or, load hbase with wikipedia, the flickr dataset, or any dataset that appeals. Extra credit for documenting steps involved and filing issues where API is awkward or hard to follow. + Add actionable statistics to hbase internals that capture vitals about the data being served and that we exploit responding to queries; e.g. rough sizes of rows, column-families, columns-per-row-per-region, etc. For example, if a client has been stepping sequentially through the data, the stats would allow us recognize this state so we could switch to a different scan type; one that is optimal to a sequential progression. + Review and redo our fundamental merge sort, the basis of our read. There are a few techniques to try such as a "loser tree merge" ( http://sandbox.mc.edu/~bennet/cs402/lec/losedex.html) but ideally we'd make our merge sort block-based rather than Cell-based. Set yourself up in a rig and try different Cell formats to get yourself to a cache-friendly Cell format that maximizes instructions per cycle. + Our client is heavy-weight and has accumulated lots of logic over time. E.g. it is hard to set a single timeout for a request because client is layered each with its own running timeouts. At its core is a mostly-done async engine. Review, and finish the async work. Rewrite where it makes sense after analysis. + Our RPC is based on protobuf Service where we plugged in our own RPC transport. An exploratory PoC putting HBase up on grpc was done by the grpc team. Bring this project home. Extra points if you reveal a Streaming Interface between Client and Server. + Tiering... if regions are cold, close them so they don't occupy resources (close files, purge its data from cache...).... reopen when a request comes in.... + Dynamic configuration of running HBase St.Ack > Thanks > -- > Talat UYARER >