We used URIs as file paths in Hadoop.  I think it was a mistake, for a
few different reasons.

URIs are actually very complex.  You probably know about scheme, host,
and port, but did you know about authority, user-info, query, fragment,
scheme-specific-part?  Do you know what they do in Hadoop?  The mapping
isn't obvious (and it wouldn't be obvious in Kafka either).

When you flip back and forth between URIs and strings (and you
inevitably will do this, when serializing or sending things over the
wire), you run into tons of really hard problems.  Should you preserve
the "fragment" (the thing after the hash mark) for your URI, or not?  It
may not do anything now, but maybe it will do something later.  URIs
also have complex string escaping rules.  Parsing URIs is very messy,
especially when you start talking about non-Java programming languages.

URIs are designed for a world where you talk to a single host over a
single port.  That isn't the world distributed systems live in.  You
don't want your clients to fail to bootstrap because the single server
you specified is having a bad day, even when the other 8 servers are up.

What we want is a list of servers.  But URIs don't give us that.  That's
why in HDFS, we introduced another layer of indirection, so that the
"URI hostname" maps to an entry in a configuration file, which then maps
to a list of hostnames.

Later on, we found out that people wanted a unified namespace.  They
wanted to be able to access /foo without caring whether it was on s3,
the first hdfs cluster, or the second hdfs cluster.  But our use of URIs
for paths had made that impossible.  If the path was on s3, it had to be
accessed via s3://mybucketname/foo.  If it was on the first hdfs
cluster, it had to be accessed by hdfs://myhdfs1name/foo.  And so on. 
We had re-invented the equivalent of DOS drive letters: ugly, clunky
drive letter prefixes that to chaperone around every path name.

The bottom line is that URIs are the wrong abstraction for the job. 
They just don't express what we really want, and they introduce a lot of
complexity and ambiguity.

best,
Colin


On Thu, Oct 5, 2017, at 08:08, Clebert Suconic wrote:
> I can start a KIP discussion on this.. or not if you really think this
> is against basic rules...
> 
> 
> I will need authorization to create the page.. if you could assign me
> regardless so I can have it for next time?
> 
> On Thu, Oct 5, 2017 at 10:31 AM, Clebert Suconic
> <clebert.suco...@gmail.com> wrote:
> > Just as a facility for users... I think it would be easier to
> > prototype consumers and producer by simply doing new
> > Consumer("tcp://HOST:PORT") or new Producer("tcp://HOST:PORT")...
> >
> > on the other project I work (ActiveMQ Artemis) we used to do a similar
> > way to what Kafka does..we then provided the URI support and I now
> > think the URI was a lot easier.
> >
> > I'm just trying to leverage my experience into here... I'm an apache
> > committer at ActiveMQ Artemis.. I think I could bring some goodies
> > into Kafka.. I see no reason to be a competitor.. instead I'm all up
> > to contribute here as well.  And I was looking for something small and
> > easy to start with.
> >
> >
> >
> >
> >
> >
> > On Thu, Oct 5, 2017 at 10:15 AM, Jay Kreps <j...@confluent.io> wrote:
> >> Hey Clebert,
> >>
> >> Is there a motivation for adding a second way? We generally try to avoid
> >> having two ways to do something unless it's really needed...I suspect you
> >> have a reason for wanting this, though.
> >>
> >> -Jay
> >>
> >> On Mon, Oct 2, 2017 at 6:15 AM Clebert Suconic <clebert.suco...@gmail.com>
> >> wrote:
> >>
> >>> At ActiveMQ and ActiveMQ Artemis, ConnectionFactories have an
> >>> interesting feature where you can pass parameters through an URI.
> >>>
> >>> I was looking at Producer and Consumer APIs, and these two classes are
> >>> using a method that I considered old for Artemis resembling HornetQ:
> >>>
> >>> Instead of passing a Properties (aka HashMaps), users would be able to
> >>> create a Consumer or Producer by simply doing:
> >>>
> >>> new Consumer("tcp::/host:port?properties=values;properties=values...etc");
> >>>
> >>> Example:
> >>>
> >>>
> >>> Instead of the following:
> >>>
> >>> Map<String, Object> config = new HashMap<>();
> >>> config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9999");
> >>> config.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, -2);
> >>> new KafkaConsumer<>(config, new ByteArrayDeserializer(), new
> >>> ByteArrayDeserializer());
> >>>
> >>>
> >>>
> >>> Someone could do
> >>>
> >>> new KafkaConsumer<>("tcp://localhost:9999?receive.buffer.bytes=-2",
> >>> new ByteArrayDeserializer(), new ByteArrayDeserializer());
> >>>
> >>>
> >>>
> >>> I don't know if that little API improvement would be welcomed? I would be
> >>> able to send a Pull Request but I don't want to do it if that wouldn't
> >>> be welcomed in the first place:
> >>>
> >>>
> >>> Just an idea...  let me know if that is welcomed or not.
> >>>
> >>> If so I can forward the discussion into how I would implement it.
> >>>
> >
> >
> >
> > --
> > Clebert Suconic
> 
> 
> 
> -- 
> Clebert Suconic

Reply via email to