wholeTextAPI() extremely SLOW under high load (How to fix?)

2021-10-02 Thread Rachana Srivastava
Issue:  We are using wholeTextFile() API to read files from S3.  But this API is extremely SLOW due to reasons mentioned below.  Question is how to fix this issue? Here is our analysis so FAR:  Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile API works in two step.

S3-SQS vs Auto Loader With Apache Spark Structured Streaming

2020-12-20 Thread Rachana Srivastava
Problem Statement: I want to read files from S3 write files to s3 using Spark Structured Streaming. I looked at the reference architecture recommended by Spark team that recommends using S3 -> SNS -> SQS using S3-SQS file source. Question: - S3-SQS file source: Is S3-SQS file source

Need your help!! (URGENT Code works fine when submitted as java main but part of data missing when running as Spark-Submit)

2020-07-21 Thread Rachana Srivastava
I am unable to identify the root cause of why my code is missing data when I run as spark-submit but the code works fine when I run as java main  Any idea

OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Rachana Srivastava
Issue: I am trying to process 5000+ files of gzipped json file periodically from S3 using Structured Streaming code.  Here are the key steps: - Read json schema and broadccast to executors - Read Stream Dataset inputDS = sparkSession.readStream() .format("text")

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
ons of files. Please consider it as part of your solution.  On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava wrote: I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, w

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE at long as system does not fail every 5 to 7 days with no recovery option. On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava wrote: Thanks so much TD.  Thanks for forwarding your datalake

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
re, I am involved in it) was built to exactly solve this problem - get exactly-once and ACID guarantees on files, but also scale to handling millions of files. Please consider it as part of your solution.  On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava wrote: I have written a simple spark

How to manage offsets in Spark Structured Streaming?

2020-06-17 Thread Rachana Srivastava
 Background: I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows

Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs for a

Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Rachana Srivastava
Hello all, I have created a Kafka topic with 5 partitions. And I am using createStream receiver API like following. But somehow only one receiver is getting the input data. Rest of receivers are not processign anything. Can you please help? JavaPairDStream messages = null;

How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is there anyway I can change the partitioner logic

Number of tasks on executors become negative after executor failures

2016-08-15 Thread Rachana Srivastava
Summary: I am running Spark 1.5 on CDH5.5.1. Under extreme load intermittently I am getting this connection failure exception and later negative executor in the Spark UI. Exception: TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 76ms INFO :

Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread Rachana Srivastava
I am trying to find the root cause of recent Spark application failure in production. When the Spark application is running I can check NodeManager's yarn.nodemanager.log-dir property to get the Spark executor container logs. The container has logs for both the running Spark applications Here

Spark Text Streaming Does not Recognize Folder using RegEx

2016-04-01 Thread Rachana Srivastava
Hello All, I have written a simple program to get data from JavaDStream textStream = jssc.textFileStream(); JavaDStream ceRDD = textStream.map( new Function() { public String call(String ceData) throws Exception { System.out.println(ceData); } }); } My code works file when we

How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Rachana Srivastava
I am trying to integrate SparkStreaming with HBase. I am calling following APIs to connect to HBase HConnection hbaseConnection = HConnectionManager.createConnection(conf); hBaseTable = hbaseConnection.getTable(hbaseTable); Since I cannot get the connection and broadcast the connection each

HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Rachana Srivastava
I am trying to run an application in yarn cluster mode. Spark-Submit with Yarn Cluster Here are setting of the shell script: spark-submit --class "com.Myclass" \ --num-executors 2 \ --executor-cores 2 \ --master yarn \ --supervise \ --deploy-mode cluster \ ../target/ \ My application is working

Spark process failing to receive data from the Kafka queue in yarn-client mode.

2016-02-05 Thread Rachana Srivastava
I am trying to run following code using yarn-client mode in but getting slow readprocessor error mentioned below but the code works just fine in the local mode. Any pointer is really appreciated. Line of code to receive data from the Kafka Queue: JavaPairReceiverInputDStream

RE: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Rachana Srivastava
.scala:237) at com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(CheckFeatureImportance.java:49) From: Rachana Srivastava Sent: Wednesday, January 13, 2016 3:30 PM To: 'user@spark.apache.org'; 'd...@spark.apache.org' Subject: Random Forest FeatureImportance throwing NullPointerException I have a Random fo

Random Forest FeatureImportance throwing NullPointerException

2016-01-13 Thread Rachana Srivastava
I have a Random forest model for which I am trying to get the featureImportance vector. Map categoricalFeaturesParam = new HashMap<>(); scala.collection.immutable.Map categoricalFeatures = (scala.collection.immutable.Map)

RE: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
duler.DAGScheduler - Job 1 finished: foreachRDD at KafkaURLStreaming.java:90, took 0.151210 s &&&&&&&&&&&&&&&&&&&&& AFTER COUNT OF ACCUMULATOR IS 1 -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthr

Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Rachana Srivastava
I have a very simple two lines program. I am getting input from Kafka and save the input in a file and counting the input received. My code looks like this, when I run this code I am getting two accumulator count for each input. HashMap kafkaParams = new HashMap

Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Rachana Srivastava
Hello All, I am running my application on Spark cluster but under heavy load the system is hung due to deadlock. I found similar issues resolved here https://datastax-oss.atlassian.net/browse/JAVA-555 in Spark version 2.1.3. But I am running on Spark 1.3 still getting the same issue. Here

java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-15 Thread Rachana Srivastava
I have recently upgraded spark version but when I try to run save a random forest model using model save command I am getting nosuchmethoderror. My code works fine with 1.3x version. model.save(sc.sc(), "modelsavedir"); ERROR:

spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Rachana Srivastava
Issue: I have a random forest model that am trying to load during streaming using following code. The code is working fine when I am running the code from Eclipse but getting NPE when running the code using spark-submit. JavaStreamingContext jssc = new JavaStreamingContext(jsc,

Frozen exception while dynamically creating classes inside Spark using JavaAssist API

2015-11-03 Thread Rachana Srivastava
I am trying to dynamically create a new class in Spark using javaassist API. The code seems very simple just invoking makeClass API on a hardcoded class name. The code works find outside Spark environment but getting this chedkNotFrozen exception when I am running the code inside Spark Code

yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Rachana Srivastava
I am trying to submit a job using yarn-cluster mode using spark-submit command. My code works fine when I use yarn-client mode. Cloudera Version: CDH-5.4.7-1.cdh5.4.7.p0.3 Command Submitted: spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming" \ --driver-java-options

Where are logs for Spark Kafka Yarn on Cloudera

2015-09-29 Thread Rachana Srivastava
Hello all, I am trying to test JavaKafkaWordCount on Yarn, to make sure Yarn is working fine I am saving the output to hdfs. The example works fine in local mode but not on yarn mode. I cannot see any output logged when I changed the mode to yarn-client or yarn-cluster or cannot find any

spark-submit classloader issue...

2015-09-28 Thread Rachana Srivastava
Hello all, Goal: I want to use APIs from HttpClient library 4.4.1. I am using maven shaded plugin to generate JAR. Findings: When I run my program as a java application within eclipse everything works fine. But when I am running the program using spark-submit I am getting following

JavaRDD using Reflection

2015-09-14 Thread Rachana Srivastava
Hello all, I am working a problem that requires us to create different set of JavaRDD based on different input arguments. We are getting following error when we try to use a factory to create JavaRDD. Error message is clear but I am wondering is there any workaround. Question: How to create

New JavaRDD Inside JavaPairDStream

2015-09-11 Thread Rachana Srivastava
Hello all, Can we invoke JavaRDD while processing stream from Kafka for example. Following code is throwing some serialization exception. Not sure if this is feasible. JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(5)); JavaPairReceiverInputDStream

Can Spark Provide Multiple Context Support?

2015-09-08 Thread Rachana Srivastava
Question: How does Spark support multiple context? Background: I have a stream of data coming to Spark from Kafka. For each data in the stream I want to download some files from HDFS and process the file data. I have written code to process the file from HDFS and I have code written to