Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Qiang Li
.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any mone

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 17 September 2016 at 13:59, Qiang Li <q...@appannie.com> wrote: > >> Hi, >> &g

Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi, I use spark to generate data , then we use hive/pig/presto/spark to analyze data, but I found even I add used bucketBy and sortBy with bucket number in Spark, the results files was generate by Spark is always far more than bucket number under each partition, then Presto can not recognize the

Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
; https://www.mail-archive.com/user@spark.apache.org/msg56791.html > > // maropu > > > On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li <q...@appannie.com> wrote: > >> Hi, >> >> >> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very >> qui

Spark output data to S3 is very slow

2016-09-16 Thread Qiang Li
Hi, I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very quickly, but the last step, spark spend lots of time to rename or move data from s3 temporary directory to real directory, then I try to set