Hi spark users, I'm trying to filter a json file that has the following schema using Spark SQL:
root |-- user_id: string (nullable = true) |-- item: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- item_id: string (nullable = true) | | |-- name: string (nullable = true) I would like to filter distinct user_id based on the items it contains. For instance, I would like to find out distinct user_id which has item's name equal to "apple" for the following 'user' table user_id | item 1 | ([1, apple], [1, apple], [2, orange]) 2 | ([2, orange]) The result should be 1 I tried using hql: select user_id from user lateral view explode(item) itemTable as itemColumn where itemColumn.name = 'apple' group by user_id but it seems not efficient if I can just stop looking through the item array once I find the first item with name 'apple'. Also the "lateral view explode" and "group by" are unnecessary. I'm thinking of processing the 'user' table as SchemaRDD. Ideally, I would love to do (assuming user is a SchemaRDD): val ids =user.select('user_id).where(contain('item, "name", "apple")).collect() the contain function will loop through the item with "name" = "apple" with early stopping. Is this possible? If yes, how one implements the contain function? Best Regards, Jerry