Hello everyone! I'm a developer at a security ratings company. We've been moving to Spark for our data analytics and nearly every dataset we have contains IP addresses or variable-length subnets. Katherine's descriptions of use cases and attempts to emulate networking types overlap with ours. I would add that we also need to write complex queries over subnets in addition to IP addresses.
Has there been any update on this topic? https://github.com/apache/spark/pull/16478 was last updated in February of this year. I would also like to know if it would be better to work toward IP networking types. Supposing Spark had UDT support, would it be just as good as built-in support for networking types? Where would they fall short? Would it be possible to pass custom rules catalyst for optimizing expressions with networking types? We have to write complex joins over predicates like subnet containment and have to resort to difficult to read tricks to ensure that Spark doesn't resort to an inefficient join strategy. For example, it would be great to simply write `df1.join(df2, contains($"src_net", $"dst_net")` to join records from one dataset that have subnets that are contained in another. ----- Michael Lopez Cheerful Engineer! -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org