I am currently using filter inside a loop of all severity levels to do this, 
which I think is pretty inefficient. It has to read the entire data set once 
for each severity. I wonder if there is a more efficient way that takes just 
one pass of the data? Thanks.

Best,
Hao Wang

> On Jun 13, 2015, at 3:48 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> 
> Are you looking for something like filter? See a similar example here 
> https://spark.apache.org/examples.html 
> <https://spark.apache.org/examples.html>
> 
> Thanks
> Best Regards
> 
> On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang <bill...@gmail.com 
> <mailto:bill...@gmail.com>> wrote:
> Hi,
> 
> I have a bunch of large log files on Hadoop. Each line contains a log and its 
> severity. Is there a way that I can use Spark to split the entire data set 
> into different files on Hadoop according the severity field? Thanks. Below is 
> an example of the input and output.
> 
> Input:
> [ERROR] log1
> [INFO] log2
> [ERROR] log3
> [INFO] log4
> 
> Output:
> error_file
> [ERROR] log1
> [ERROR] log3
> 
> info_file
> [INFO] log2
> [INFO] log4
> 
> 
> Best,
> Hao Wang
> 

Reply via email to