Re: Questions about the future of UDTs and Encoders

Michael Lopez Sun, 19 Nov 2017 05:59:39 -0800

Thank you for your response, Grandjean.

Frameless looks great, but it is not quite what I need. From what I can
tell, Frameless provides a layer of type-safety on top of Spark facilities,
like column expressions and encoders. There are also some great quality
enhancments in Frameless, like Injections. creating custom encoders. I need
support for network types and their fundamental operators just like in
Postgres (https://www.postgresql.org/docs/current/static/functions-net.html)
and Cassandra (http://cassandra.apache.org/doc/latest/cql/types.html).


Specifically, I'm looking for the following.

- Column expressions for manpulating network values like IP addresses and
variable-length subnets.
- Tungsten support for optimal data representations of network types. While
this is easy to emulate for IPv4 addresses (32-bit integers), it is messy
to emulate variable-length IPv6 subnets.
- Support for custom catalyst optimization rules for predicates like subnet
containment.

Can UDTs evan support the following? Or would we need to add network types
to the list of built-ins to achieve the above features?

On Sat, Nov 18, 2017 at 8:51 PM Grandjean Patrick <patg...@yahoo.fr> wrote:

> Hi Michael,
>
> Having faced the same limitation, I have found these two libraries to be
> helpful:
>
> - Frameless (https://github.com/typelevel/frameless)
> - struct-type-encoder (
> https://benfradet.github.io/blog/2017/06/14/Deriving-Spark-Dataframe-schemas-with-Shapeless
> )
>
> Both use Shapeless to derive Datasets.
>
> I hope it helps.
>
> Patrick.
>
>
> On Nov 14, 2017, at 20:38, mlopez <michael.lopez....@gmail.com> wrote:
>
> Hello everyone!
>
> I'm a developer at a security ratings company. We've been moving to Spark
> for our data analytics and nearly every dataset we have contains IP
> addresses or variable-length subnets. Katherine's descriptions of use cases
> and attempts to emulate networking types overlap with ours. I would add
> that
> we also need to write complex queries over subnets in addition to IP
> addresses.
>
> Has there been any update on this topic?
> https://github.com/apache/spark/pull/16478 was last updated in February of
> this year.
>
> I would also like to know if it would be better to work toward IP
> networking
> types. Supposing Spark had UDT support, would it be just as good as
> built-in
> support for networking types? Where would they fall short? Would it be
> possible to pass custom rules catalyst for optimizing expressions with
> networking types?
>
> We have to write complex joins over predicates like subnet containment and
> have to resort to difficult to read tricks to ensure that Spark doesn't
> resort to an inefficient join strategy. For example, it would be great to
> simply write `df1.join(df2, contains($"src_net", $"dst_net")` to join
> records from one dataset that have subnets that are contained in another.
>
>
>
> -----
> Michael Lopez
> Cheerful Engineer!
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> <dev-unsubscr...@spark.apache.org>
>
>
>

Re: Questions about the future of UDTs and Encoders

Reply via email to