Hi everyone, I am working on a dataset like this user_id url 1 lao.com/buy 2 bao.com/sell 2 cao.com/market 1 lao.com/sell 3 vui.com/sell
I have to find all user_id with url not contain sell. Which means I need to query all user_id contains sell and put it into a set then do another query to find all user_id not in that set. SELECT user_id FROM data WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’; My data is about 20 million records and it’s growing. When I tried in zeppelin I need to set spark.sql.crossJoin.enabled = true Then I ran the query and the driver got extremely high CPU percentage and the process get stuck and I need to kill it. I am running at client mode that submit to a Mesos cluster. I am using Spark 2.0.2 and my data store in HDFS with parquet format. Any advices for me in this situation? Thank you in advance!. Regards, Chanh