Thank you for your response, Grandjean. Frameless looks great, but it is not quite what I need. From what I can tell, Frameless provides a layer of type-safety on top of Spark facilities, like column expressions and encoders. There are also some great quality enhancments in Frameless, like Injections. creating custom encoders. I need support for network types and their fundamental operators just like in Postgres (https://www.postgresql.org/docs/current/static/functions-net.html) and Cassandra (http://cassandra.apache.org/doc/latest/cql/types.html).
Specifically, I'm looking for the following. - Column expressions for manpulating network values like IP addresses and variable-length subnets. - Tungsten support for optimal data representations of network types. While this is easy to emulate for IPv4 addresses (32-bit integers), it is messy to emulate variable-length IPv6 subnets. - Support for custom catalyst optimization rules for predicates like subnet containment. Can UDTs evan support the following? Or would we need to add network types to the list of built-ins to achieve the above features? On Sat, Nov 18, 2017 at 8:51 PM Grandjean Patrick <patg...@yahoo.fr> wrote: > Hi Michael, > > Having faced the same limitation, I have found these two libraries to be > helpful: > > - Frameless (https://github.com/typelevel/frameless) > - struct-type-encoder ( > https://benfradet.github.io/blog/2017/06/14/Deriving-Spark-Dataframe-schemas-with-Shapeless > ) > > Both use Shapeless to derive Datasets. > > I hope it helps. > > Patrick. > > > On Nov 14, 2017, at 20:38, mlopez <michael.lopez....@gmail.com> wrote: > > Hello everyone! > > I'm a developer at a security ratings company. We've been moving to Spark > for our data analytics and nearly every dataset we have contains IP > addresses or variable-length subnets. Katherine's descriptions of use cases > and attempts to emulate networking types overlap with ours. I would add > that > we also need to write complex queries over subnets in addition to IP > addresses. > > Has there been any update on this topic? > https://github.com/apache/spark/pull/16478 was last updated in February of > this year. > > I would also like to know if it would be better to work toward IP > networking > types. Supposing Spark had UDT support, would it be just as good as > built-in > support for networking types? Where would they fall short? Would it be > possible to pass custom rules catalyst for optimizing expressions with > networking types? > > We have to write complex joins over predicates like subnet containment and > have to resort to difficult to read tricks to ensure that Spark doesn't > resort to an inefficient join strategy. For example, it would be great to > simply write `df1.join(df2, contains($"src_net", $"dst_net")` to join > records from one dataset that have subnets that are contained in another. > > > > ----- > Michael Lopez > Cheerful Engineer! > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > <dev-unsubscr...@spark.apache.org> > > >