Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to control RDD partition size I heard that DF can several files in 1 task
On Thu, May 19, 2016 at 8:50 PM, 王晓龙/01111515 <roland8...@cmbchina.com> wrote: > I’m using a spark streaming program to store log message into parquet file > every 10 mins. > Now, when I query the parquet, it usually takes hundreds of thousands of > stages to compute a single count. > I looked into the parquet file’s path and find a great amount of small > files. > > Do the small files caused the problem? Can I merge them, or is there a > better way to solve it? > > Lots of thanks. > > ________________________________ > > 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >