Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Rachana Srivastava
Hello all,

I have created a Kafka topic with 5 partitions.  And I am using createStream 
receiver API like following.   But somehow only one receiver is getting the 
input data. Rest of receivers are not processign anything.  Can you please help?

JavaPairDStream messages = null;

if(sparkStreamCount > 0){
// We create an input DStream for each partition of the topic, 
unify those streams, and then repartition the unified stream.
List> kafkaStreams = new 
ArrayList>(sparkStreamCount);
for (int i = 0; i < sparkStreamCount; i++) {
kafkaStreams.add( KafkaUtils.createStream(jssc, 
contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID), 
kafkaTopicMap));
}
messages = jssc.union(kafkaStreams.get(0), 
kafkaStreams.subList(1, kafkaStreams.size()));
}
else{
messages =  KafkaUtils.createStream(jssc, 
contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID), 
kafkaTopicMap);
}



[cid:image001.png@01D20E84.3558F520]






How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n 
seconds. I am using repartition to scale the application. Since the repartition 
size is fixed we are getting lots of small files when batch size is very small. 
Is there anyway I can change the partitioner logic based on the input batch 
size in order to avoid lots of small files.


Number of tasks on executors become negative after executor failures

2016-08-15 Thread Rachana Srivastava
Summary:
I am running Spark 1.5 on CDH5.5.1.  Under extreme load intermittently I am 
getting this connection failure exception and later negative executor in the 
Spark UI.

Exception:
TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 
76ms
INFO : org.apache.spark.network.client.TransportClientFactory - Found inactive 
connection to /xxx.xxx.xxx., creating a new one.
ERROR: org.apache.spark.network.shuffle.RetryingBlockFetcher - Exception while 
beginning fetch of 1 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to /xxx.xxx.xxx.
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /xxx.xxx.xxx.
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more


Related Defects:
https://issues.apache.org/jira/browse/SPARK-2319
https://issues.apache.org/jira/browse/SPARK-9591


[cid:image001.png@01D1F6EE.1CCFE110]


Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread Rachana Srivastava
I am trying to find the root cause of recent Spark application failure in 
production. When the Spark application is running I can check NodeManager's 
yarn.nodemanager.log-dir property to get the Spark executor container logs.

The container has logs for both the running Spark applications

Here is the view of the container logs: drwx--x--- 3 yarn yarn 51 Jul 19 09:04 
application_1467068598418_0209 drwx--x--- 5 yarn yarn 141 Jul 19 09:04 
application_1467068598418_0210

But when the application is killed both the application logs are automatically 
deleted. I have set all the log retention setting etc in Yarn to a very large 
number. But still these logs are deleted as soon as the Spark applications are 
crashed.

Question: How can we retain these Spark application logs in Yarn for debugging 
when the Spark application is crashed for some reason.


Spark process failing to receive data from the Kafka queue in yarn-client mode.

2016-02-05 Thread Rachana Srivastava
I am trying to run following code using yarn-client mode in but getting slow 
readprocessor error mentioned below but the code works just fine in the local 
mode.  Any pointer is really appreciated.

Line of code to receive data from the Kafka Queue:
JavaPairReceiverInputDStream messages =  
KafkaUtils.createStream(jssc, String.class, String.class, StringDecoder.class, 
StringDecoder.class, kafkaParams, kafkaTopicMap, StorageLevel.MEMORY_ONLY());

JavaDStream lines = messages.map(new Function, 
String>() {
  public String call(Tuple2 tuple2) {
  LOG.info("  Input json stream 
data  " +  tuple2._2);
return tuple2._2();
  }
});


Error Details:
016-02-05 11:44:00 WARN DFSClient:975 - Slow ReadProcessor read fields took 30
011ms (threshold=3ms); ack: seqno: 1960 reply: 0 reply: 0 reply: 0 downstrea
mAckTimeNanos: 1227280, targets: [DatanodeInfoWithStorage[10.0.0.245:50010,DS-a5
5d9212-3771-4936-bbe7-02035e7de148,DISK], DatanodeInfoWithStorage[10.0.0.243:500
10,DS-231b9915-c2e2-4392-b075-8a52ba1820ac,DISK], DatanodeInfoWithStorage[10.0.0
.244:50010,DS-6b8b5814-7dd7-4315-847c-b73bd375af0e,DISK]]
2016-02-05 11:44:00 INFO BlockManager:59 - Removing RDD 1954
2016-02-05 11:44:00 INFO MapPartitionsRDD:59 - Removing RDD 1955 from persisten


RE: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Rachana Srivastava
Tried using 1.6 version of Spark that takes numberOfFeatures fifth argument in  
the API but still getting featureImportance as null.

RandomForestClassifier rfc = getRandomForestClassifier( numTrees,  maxBinSize,  
maxTreeDepth,  seed,  impurity);
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses,numberOfFeatures);
System.out.println(rfm.featureImportances());

Stack Trace:
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.ml.tree.impl.RandomForest$.computeFeatureImportance(RandomForest.scala:1152)
at 
org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:)
at 
org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:1108)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.ml.tree.impl.RandomForest$.featureImportances(RandomForest.scala:1108)
at 
org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances$lzycompute(RandomForestClassifier.scala:237)
at 
org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances(RandomForestClassifier.scala:237)
at 
com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(CheckFeatureImportance.java:49)

From: Rachana Srivastava
Sent: Wednesday, January 13, 2016 3:30 PM
To: 'u...@spark.apache.org'; 'dev@spark.apache.org'
Subject: Random Forest FeatureImportance throwing NullPointerException

I have a Random forest model for which I am trying to get the featureImportance 
vector.

Map categoricalFeaturesParam = new HashMap<>();
scala.collection.immutable.Map categoricalFeatures =  
(scala.collection.immutable.Map)
scala.collection.immutable.Map$.MODULE$.apply(JavaConversions.mapAsScalaMap(categoricalFeaturesParam).toSeq());
int numberOfClasses =2;
RandomForestClassifier rfc = new RandomForestClassifier();
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses);
System.out.println(rfm.featureImportances());

When I run above code I found featureImportance as null.  Do I need to set 
anything in specific to get the feature importance for the random forest model.

Thanks,

Rachana


Random Forest FeatureImportance throwing NullPointerException

2016-01-13 Thread Rachana Srivastava
I have a Random forest model for which I am trying to get the featureImportance 
vector.

Map categoricalFeaturesParam = new HashMap<>();
scala.collection.immutable.Map categoricalFeatures =  
(scala.collection.immutable.Map)
scala.collection.immutable.Map$.MODULE$.apply(JavaConversions.mapAsScalaMap(categoricalFeaturesParam).toSeq());
int numberOfClasses =2;
RandomForestClassifier rfc = new RandomForestClassifier();
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses);
System.out.println(rfm.featureImportances());

When I run above code I found featureImportance as null.  Do I need to set 
anything in specific to get the feature importance for the random forest model.

Thanks,

Rachana


Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program.  I am getting input from Kafka and save 
the input in a file and counting the input received.  My code looks like this, 
when I run this code I am getting two accumulator count for each input.

HashMap kafkaParams = new HashMap();  
kafkaParams.put("metadata.broker.list","localhost:9092");   
kafkaParams.put("zookeeper.connect", "localhost:2181");
JavaPairInputDStream messages = KafkaUtils.createDirectStream( 
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, 
kafkaParams, topicsSet);
final Accumulator accum = jssc.sparkContext().accumulator(0);
JavaDStream lines = messages.map(
new Function, String>() {
   public String call(Tuple2 tuple2) { 
accum.add(1); return tuple2._2();
} });
lines.foreachRDD(new Function, Void>() {
public Void call(JavaRDD rdd) throws Exception {
if(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ 
rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
System.out.println(" & COUNT OF ACCUMULATOR IS " + 
accum.value()); return null;}
 });
 jssc.start();

If I comment rdd.saveAsTextFile I get correct count, but with 
rdd.saveAsTextFile for each input I am getting multiple accumulator count.

Thanks,

Rachana


Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Rachana Srivastava
Hello All,

I am running my application on Spark cluster but under heavy load the system is 
hung due to deadlock.  I found similar issues resolved here 
https://datastax-oss.atlassian.net/browse/JAVA-555 in  Spark version 2.1.3.  
But I am running on Spark 1.3 still getting the same issue.

Here is the stack trace for reference:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
org.apache.spark.streaming.ContextWaiter.waitForStopOrError(ContextWaiter.scala:63)
org.apache.spark.streaming.StreamingContext.awaitTermination(StreamingContext.scala:521)
org.apache.spark.streaming.api.java.JavaStreamingContext.awaitTermination(JavaStreamingContext.scala:592)

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:62)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
java.lang.Thread.run(Thread.java:745)


Thanks,

Rachana



java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-15 Thread Rachana Srivastava
I have recently upgraded spark version but when I try to run save a random 
forest model using model save command I am getting nosuchmethoderror.  My code 
works fine with 1.3x version.



model.save(sc.sc(), "modelsavedir");


ERROR: org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation - 
Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 
(TID 230, localhost): java.lang.NoSuchMethodError: 
parquet.schema.Types$GroupBuilder.addField(Lparquet/schema/Type;)Lparquet/schema/Types$BaseGroupBuilder;
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at 
scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.sql.types.StructType.map(StructType.scala:92)
at 
org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
at 
org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:261)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Rachana Srivastava
Issue:
I have a random forest model that am trying to load during streaming using 
following code.  The code is working fine when I am running the code from 
Eclipse but getting NPE when running the code using spark-submit.

JavaStreamingContext jssc = new JavaStreamingContext(jsc, 
Durations.seconds(duration));
System.out.println("& trying to get the context 
&&& " );
final RandomForestModel model = 
RandomForestModel.load(jssc.sparkContext().sc(), MODEL_DIRECTORY);//line 116 
causing the issue.
System.out.println("& model debug &&& " 
+ model.toDebugString());


Exception Details:
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0, 
whose tasks have all completed, from pool
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData.toSplit(DecisionTreeModel.scala:144)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:291)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:287)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTree(DecisionTreeModel.scala:268)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:251)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:250)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:250)
at 
org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.loadTrees(treeEnsembleModels.scala:340)
at 
org.apache.spark.mllib.tree.model.RandomForestModel$.load(treeEnsembleModels.scala:72)
at 
org.apache.spark.mllib.tree.model.RandomForestModel.load(treeEnsembleModels.scala)
at 
com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:116)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Nov 19, 2015 1:10:56 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not 
initialize counter due to context is not a instance of TaskInputOutputContext, 
but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl

Spark Source Code:
case class PredictData(predict: Double, prob: Double) {
  def toPredict: Predict = new Predict(predict, prob)
}

Thanks,

Rachana




Frozen exception while dynamically creating classes inside Spark using JavaAssist API

2015-11-03 Thread Rachana Srivastava
I am trying to dynamically create a new class in Spark using javaassist API. 
The code seems very simple just invoking makeClass API on a hardcoded class 
name. The code works find outside Spark environment but getting this 
chedkNotFrozen exception when I am running the code inside Spark
Code Excerpt:
ClassPool pool = ClassPool.getDefault()
CtClass regExClass= pool.makeClass("TestClass14",baseFeatureProcessor)
Exception Details:
## Exception make Class
java.lang.RuntimeException: TestClass14: frozen class (cannot edit)
at javassist.ClassPool.checkNotFrozen(ClassPool.java:617)
at javassist.ClassPool.makeClass(ClassPool.java:859)



yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Rachana Srivastava
I am trying to submit a job using yarn-cluster mode using spark-submit command. 
 My code works fine when I use yarn-client mode.

Cloudera Version:
CDH-5.4.7-1.cdh5.4.7.p0.3

Command Submitted:
spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming"  \
--driver-java-options 
"-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties" \
--conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
 \
--conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
 \
--num-executors 2 \
--executor-cores 2 \
../target/mm-XXX-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
yarn-cluster 10 "XXX:2181" "XXX:9092" groups kafkaurl 5 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeature.properties"
 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeatureContent.properties"
 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/hdfsOutputNEWScript/OUTPUTYarn2"
  false


Log Details:
INFO : org.apache.spark.SparkContext - Running Spark version 1.3.0
INFO : org.apache.spark.SecurityManager - Changing view acls to: ec2-user
INFO : org.apache.spark.SecurityManager - Changing modify acls to: ec2-user
INFO : org.apache.spark.SecurityManager - SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(ec2-user); users 
with modify permissions: Set(ec2-user)
INFO : akka.event.slf4j.Slf4jLogger - Slf4jLogger started
INFO : Remoting - Starting remoting
INFO : Remoting - Remoting started; listening on addresses 
:[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
INFO : Remoting - Remoting now listens on addresses: 
[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
INFO : org.apache.spark.util.Utils - Successfully started service 'sparkDriver' 
on port 49579.
INFO : org.apache.spark.SparkEnv - Registering MapOutputTracker
INFO : org.apache.spark.SparkEnv - Registering BlockManagerMaster
INFO : org.apache.spark.storage.DiskBlockManager - Created local directory at 
/tmp/spark-1c805495-c7c4-471d-973f-b1ae0e2c8ff9/blockmgr-fff1946f-a716-40fc-a62d-bacba5b17638
INFO : org.apache.spark.storage.MemoryStore - MemoryStore started with capacity 
265.4 MB
INFO : org.apache.spark.HttpFileServer - HTTP File server directory is 
/tmp/spark-8ed6f513-854f-4ee4-95ea-87185364eeaf/httpd-75cee1e7-af7a-4c82-a9ff-a124ce7ca7ae
INFO : org.apache.spark.HttpServer - Starting HTTP Server
INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
INFO : org.spark-project.jetty.server.AbstractConnector - Started 
SocketConnector@0.0.0.0:46671
INFO : org.apache.spark.util.Utils - Successfully started service 'HTTP file 
server' on port 46671.
INFO : org.apache.spark.SparkEnv - Registering OutputCommitCoordinator
INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
INFO : org.spark-project.jetty.server.AbstractConnector - Started 
SelectChannelConnector@0.0.0.0:4040
INFO : org.apache.spark.util.Utils - Successfully started service 'SparkUI' on 
port 4040.
INFO : org.apache.spark.ui.SparkUI - Started SparkUI at 
http://ip-10-0-0-XXX.us-west-2.compute.internal:4040
INFO : org.apache.spark.SparkContext - Added JAR 
file:/home/ec2-user/CE/correlationengine/scripts/../target/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 at 
http://10.0.0.XXX:46671/jars/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 with timestamp 1444620509463
INFO : org.apache.spark.scheduler.cluster.YarnClusterScheduler - Created 
YarnClusterScheduler
ERROR: org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - 
Application ID is not set.
INFO : org.apache.spark.network.netty.NettyBlockTransferService - Server 
created on 33880
INFO : org.apache.spark.storage.BlockManagerMaster - Trying to register 
BlockManager
INFO : org.apache.spark.storage.BlockManagerMasterActor - Registering block 
manager ip-10-0-0-XXX.us-west-2.compute.internal:33880 with 265.4 MB RAM, 
BlockManagerId(, ip-10-0-0-XXX.us-west-2.compute.internal, 33880)
INFO : org.apache.spark.storage.BlockManagerMaster - Registered BlockManager
INFO : org.apache.spark.scheduler.EventLoggingListener - Logging events to 
hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/spark/applicationHistory/spark-application-1444620509497
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:580)
at 
org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:541)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at 
com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.Na

Where are logs for Spark Kafka Yarn on Cloudera

2015-09-28 Thread Rachana Srivastava
Hello all,

I am trying to test JavaKafkaWordCount on Yarn, to make sure Yarn is working 
fine I am saving the output to hdfs.  The example works fine in local mode but 
not on yarn mode.  I cannot see any output logged when I changed the mode to 
yarn-client or yarn-cluster or cannot find any errors logged.  For my 
application id I was looking for logs under /var/log/hadoop-yarn/containers 
(e.g 
/var/log/hadoop-yarn/containers/application_1439517792099_0010/container_1439517792099_0010_01_03/stderr)
 but I cannot find anything useful information.   Is it the only location where 
application logs are logged.

Also tried setting log output to spark.yarn.app.container.log.dir but got 
access denied error.

Question:  Do we need to have some special setup to run spark streaming on 
Yarn?  How do we debug?  Where to find more details to test streaming on Yarn.

Thanks,

Rachana


spark-submit classloader issue...

2015-09-28 Thread Rachana Srivastava
Hello all,

Goal:  I want to use APIs from HttpClient library 4.4.1.  I am using maven 
shaded plugin to generate JAR.



Findings: When I run my program as a java application within eclipse everything 
works fine.  But when I am running the program using spark-submit I am getting 
following error:

URL content Could not initialize class 
org.apache.http.conn.ssl.SSLConnectionSocketFactory

java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.http.conn.ssl.SSLConnectionSocketFactory



When I tried to get the referred JAR it is pointing to some Hadoop JAR,  I am 
assuming this is something set in spark-submit.



ClassLoader classLoader = HttpEndPointClient.class.getClassLoader();

URL resource = 
classLoader.getResource("org/apache/http/message/BasicLineFormatter.class");

Prints following jar:

jar:file:/usr/lib/hadoop/lib/httpcore-4.2.5.jar!/org/apache/http/message/BasicLineFormatter.class



After research I found that I can override --conf 
spark.files.userClassPathFirst=true --conf spark.yarn.user.classpath.first=true



But when I do that I am getting following error:

ERROR: org.apache.spark.executor.Executor - Exception in task 0.0 in stage 0.0 
(TID 0)

java.io.InvalidClassException: org.apache.spark.scheduler.Task; local class 
incompatible: stream classdesc serialVersionUID = -4703555755588060120, local 
class serialVersionUID = -1589734467697262504

at 
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)

at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)

at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)

at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)

at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)

at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)

at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)

at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)



I am running on CDH 5.4  Here is my complete pom file.



http://maven.apache.org/POM/4.0.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd">

4.0.0

test

test

0.0.1-SNAPSHOT






org.apache.httpcomponents


httpcore

4.4.1






org.apache.httpcomponents


httpclient

4.4.1






org.apache.spark


spark-streaming-kafka_2.10

1.5.0





 httpcore

  org.apache.httpcomponents










org.apache.spark


spark-streaming_2.10

1.5.0





 httpcore

  org.apache.httpcomponents










org.apache.spark


spark-core_2.10

1.5.0





 httpcore

  org.apache.httpcomponents










org.apache.spark


spark-mllib_2.10


JavaRDD using Reflection

2015-09-14 Thread Rachana Srivastava
Hello all,

I am working a problem that requires us to create different set of JavaRDD 
based on different input arguments.  We are getting following error when we try 
to use a factory to create JavaRDD.  Error message is clear but I am wondering 
is there any workaround.

Question:
How to create different set of JavaRDD based on different input arguments 
dynamically.  Trying to implement something like factory pattern.

Error Message:
RDD transformations and actions can only be invoked by the driver, not inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.

Thanks,

Rachana


Multithreaded vs Spark Executor

2015-09-11 Thread Rachana Srivastava
Hello all,

We are getting stream of input data from a Kafka queue using Spark Streaming 
API.  For each data element we want to run parallel threads to process a set of 
feature lists (nearly 100 feature or more).Since feature lists creation is 
independent of each other we would like to execute these feature lists in 
parallel on the input data that we get from the Kafka queue.

Question is

1. Should we write thread pool and manage these features execution on different 
threads in parallel.  Only concern is because of data locality we are confined 
to the node that is assigned to the input data from the Kafka stream we cannot 
leverage distributed nodes for processing of these features for a single input 
data.

2.  Or since we are using JavaRDD as a feature list, these feature execution 
will be managed internally by Spark executors.

Thanks,

Rachana


New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Rachana Srivastava
Hello all,

Can we invoke JavaRDD while processing stream from Kafka for example.  
Following code is throwing some serialization exception.  Not sure if this is 
feasible.

  JavaStreamingContext jssc = new JavaStreamingContext(jsc, 
Durations.seconds(5));
JavaPairReceiverInputDStream messages = 
KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);
JavaDStream lines = messages.map(new Function, String>() {
  public String call(Tuple2 tuple2) { return tuple2._2();
  }
});
JavaPairDStream wordCounts = lines.mapToPair( new 
PairFunction() {
public Tuple2 call(String urlString) {
String propertiesFile = 
"/home/cloudera/Desktop/sample/input/featurelist.properties";
JavaRDD propertiesFileRDD = 
jsc.textFile(propertiesFile);
  JavaPairRDD featureKeyClassPair = 
propertiesFileRDD.mapToPair(
  new PairFunction() {
  public Tuple2 
call(String property) {
return new 
Tuple2(property.split("=")[0], property.split("=")[1]);
  }
 });
featureKeyClassPair.count();
  return new Tuple2(urlString,  featureScore);
}
  });