I am currently using filter inside a loop of all severity levels to do this, which I think is pretty inefficient. It has to read the entire data set once for each severity. I wonder if there is a more efficient way that takes just one pass of the data? Thanks.
Best, Hao Wang > On Jun 13, 2015, at 3:48 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > > Are you looking for something like filter? See a similar example here > https://spark.apache.org/examples.html > <https://spark.apache.org/examples.html> > > Thanks > Best Regards > > On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang <bill...@gmail.com > <mailto:bill...@gmail.com>> wrote: > Hi, > > I have a bunch of large log files on Hadoop. Each line contains a log and its > severity. Is there a way that I can use Spark to split the entire data set > into different files on Hadoop according the severity field? Thanks. Below is > an example of the input and output. > > Input: > [ERROR] log1 > [INFO] log2 > [ERROR] log3 > [INFO] log4 > > Output: > error_file > [ERROR] log1 > [ERROR] log3 > > info_file > [INFO] log2 > [INFO] log4 > > > Best, > Hao Wang >