Hi,
Thanks for the hint. I have tried to remove the limit from the query but the 
result is still the same. If I understand correctly, the func "sample()” is 
taking a sample of the result of the query and not sampling the original table 
that I am querying.

I have a business use case to sample a lot of different queries in prod, so I 
can’t just insert 100 rows in an other table.

I am thinking about doing something like (pseudo code) :
my_df = SELECT * FROM my_table WHERE  `event_date` >= '2015-11-14' AND 
`event_date` <= '2016-02-19’
my_sample = my_df.sample(0.1)
Result = sample.groupBy("Category").agg(sum("bookings"), sum("dealviews”))


Thanks for your answer.



From: James Barney <jamesbarne...@gmail.com<mailto:jamesbarne...@gmail.com>>
Date: Tuesday, March 1, 2016 at 7:01 AM
To: maurin lenglart <mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Sample sql query using pyspark

Maurin,

I don't know the technical reason why but: try removing the 'limit 100' part of 
your query. I was trying to do something similar the other week and what I 
found is that each executor doesn't necessarily get the same 100 rows. Joins 
would fail or result with a bunch of nulls when keys weren't found between the 
slices of 100 rows.

Once I removed the 'limit xxxx' part of my query, all the results were the same 
across the board and taking samples worked again.

If the amount of data is too large, or you're trying to just test on a smaller 
size, just define another table and insert only 100 rows into that table.

I hope that helps!

On Tue, Mar 1, 2016 at 3:10 AM, Maurin Lenglart 
<mau...@cuberonlabs.com<mailto:mau...@cuberonlabs.com>> wrote:
Hi,
I am trying to get a sample of a sql query in to make the query run faster.
My query look like this :
SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`) 
as `dealviews` FROM groupon_dropbox WHERE  `event_date` >= '2015-11-14' AND 
`event_date` <= '2016-02-19' GROUP BY `Category` LIMIT 100

The table is partitioned by event_date. And the code I am using is:
 df = self.df_from_sql(sql, srcs)

results = df.sample(False, 0.5).collect()

 The results are a little bit different, but the execution time is almost the 
same. Am I missing something?


thanks

Reply via email to