Hello everyone!

I'm a developer at a security ratings company. We've been moving to Spark
for our data analytics and nearly every dataset we have contains IP
addresses or variable-length subnets. Katherine's descriptions of use cases
and attempts to emulate networking types overlap with ours. I would add that
we also need to write complex queries over subnets in addition to IP
addresses.

Has there been any update on this topic?
https://github.com/apache/spark/pull/16478 was last updated in February of
this year.

I would also like to know if it would be better to work toward IP networking
types. Supposing Spark had UDT support, would it be just as good as built-in
support for networking types? Where would they fall short? Would it be
possible to pass custom rules catalyst for optimizing expressions with
networking types?

We have to write complex joins over predicates like subnet containment and
have to resort to difficult to read tricks to ensure that Spark doesn't
resort to an inefficient join strategy. For example, it would be great to
simply write `df1.join(df2, contains($"src_net", $"dst_net")` to join
records from one dataset that have subnets that are contained in another.



-----
Michael Lopez
Cheerful Engineer!
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to