This code makes the job runs 2x as long. Is there a way to improve it?

2017-09-28 Thread Noppanit Charassinvichai
We're trying to filter out some records of the output that we have to another table in ORC and the job takes twice as long. Not sure if there's a better way to do this? Here's the code jsonRows.foreachRDD(r => { val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r) val cnsDf =

Re: ClassNotFoundException for Workers

2017-07-31 Thread Noppanit Charassinvichai
amazonaws" % "aws-java-sdk-core" % "1.11.155" Not sure if I need special configuration? On Tue, 25 Jul 2017 at 04:17 周康 <zhoukang199...@gmail.com> wrote: > Ensure com.amazonaws.services.s3.AmazonS3ClientBuilder in your classpath > which include your appli

ClassNotFoundException for Workers

2017-07-19 Thread Noppanit Charassinvichai
I have this spark job which is using S3 client in mapPartition. And I get this error Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 74, ip-10-90-78-177.ec2.internal, executor 11): java.lang.NoClassDefFoundError: Could not

[Spark Streaming] How to make this code work?

2017-07-18 Thread Noppanit Charassinvichai
I'm super new to Spark and I'm writing this job to parse nginx log to ORC file format so it can be read from Presto. We wrote LogLine2Json which parse a line of nginx log to json. And that works fine. val sqs = streamContext.receiverStream(new SQSReceiver("elb") //.credentials("key",