Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 10000”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this? 
BTW, there is no order by clause and I do not care which 10000 records I get as 
long as the total number is less or equal then 10000.

Reply via email to