[ 
https://issues.apache.org/jira/browse/SOLR-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258466#comment-17258466
 ] 

Kevin Risden commented on SOLR-15051:
-------------------------------------

{quote}"Hadoop filesystem interface" is an ideal choice{quote}

Not sure I would go so far as an "ideal choice" - it definitely brings in a lot 
of dependencies sadly. The nice part is that it has implementations for local, 
hdfs, s3, adls, and gcp at least. So you get multi cloud for free. What I do 
not know is how efficient each of those implementations are - especially the 
"local" one for file:// type paths.

Just wanted to point out the Hadoop filesystem connector stuff is separate from 
the full running of HDFS. Here are a few references:

* General overview: 
http://hadoop.apache.org/docs/current3/hadoop-project-dist/hadoop-common/filesystem/index.html
* AWS: 
http://hadoop.apache.org/docs/current3/hadoop-aws/tools/hadoop-aws/index.html
* Azure: http://hadoop.apache.org/docs/current3/hadoop-azure/index.html
* Google: https://github.com/GoogleCloudDataproc/hadoop-connectors
* 

> Shared storage -- BlobDirectory (de-duping)
> -------------------------------------------
>
>                 Key: SOLR-15051
>                 URL: https://issues.apache.org/jira/browse/SOLR-15051
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This proposal is a way to accomplish shared storage in SolrCloud with a few 
> key characteristics: (A) using a Directory implementation, (B) delegates to a 
> backing local file Directory as a kind of read/write cache (C) replicas have 
> their own "space", (D) , de-duplication across replicas via reference 
> counting, (E) uses ZK but separately from SolrCloud stuff.
> The Directory abstraction is a good one, and helps isolate shared storage 
> from the rest of SolrCloud that doesn't care.  Using a backing normal file 
> Directory is faster for reads and is simpler than Solr's HDFSDirectory's 
> BlockCache.  Replicas having their own space solves the problem of multiple 
> writers (e.g. of the same shard) trying to own and write to the same space, 
> and it implies that any of Solr's replica types can be used along with what 
> goes along with them like peer-to-peer replication (sometimes faster/cheaper 
> than pulling from shared storage).  A de-duplication feature solves needless 
> duplication of files across replicas and from parent shards (i.e. from shard 
> splitting).  The de-duplication feature requires a place to cache directory 
> listings so that they can be shared across replicas and atomically updated; 
> this is handled via ZooKeeper.  Finally, some sort of Solr daemon / 
> auto-scaling code should be added to implement "autoAddReplicas", especially 
> to provide for a scenario where the leader is gone and can't be replicated 
> from directly but we can access shared storage.
> For more about shared storage concepts, consider looking at the description 
> in SOLR-13101 and the linked Google Doc.
> *[PROPOSAL 
> DOC|https://docs.google.com/document/d/1kjQPK80sLiZJyRjek_Edhokfc5q9S3ISvFRM2_YeL8M/edit?usp=sharing]*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to