Issue: We are using wholeTextFile() API to read files from S3. But this API
is extremely SLOW due to reasons mentioned below. Question is how to fix this
issue?
Here is our analysis so FAR:
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile
API works in two step.
Problem Statement: I want to read files from S3 write files to s3 using Spark
Structured Streaming. I looked at the reference architecture recommended by
Spark team that recommends using S3 -> SNS -> SQS using S3-SQS file source.
Question:
- S3-SQS file source: Is S3-SQS file source
I am unable to identify the root cause of why my code is missing data when I
run as spark-submit but the code works fine when I run as java main Any
idea
Issue: I am trying to process 5000+ files of gzipped json file periodically
from S3 using Structured Streaming code.
Here are the key steps:
-
Read json schema and broadccast to executors
-
Read Stream
Dataset inputDS = sparkSession.readStream() .format("text")
ons of files. Please consider it as part of your
solution.
On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
wrote:
I have written a simple spark structured steaming app to move data from Kafka
to S3. Found that in order to support exactly-once guarantee spark creates
_spark_metadata folder, w
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE
at long as system does not fail every 5 to 7 days with no recovery option.
On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava
wrote:
Thanks so much TD. Thanks for forwarding your datalake
re, I am involved in it) was built to
exactly solve this problem - get exactly-once and ACID guarantees on files, but
also scale to handling millions of files. Please consider it as part of your
solution.
On Wed, Jun 17, 2020 at 9:50 AM Rachana Srivastava
wrote:
I have written a simple spark
Background: I have written a simple spark structured steaming app to move
data from Kafka to S3. Found that in order to support exactly-once guarantee
spark creates _spark_metadata folder, which ends up growing too large, when the
streaming app runs for a long time the metadata folder grows
I have written a simple spark structured steaming app to move data from Kafka
to S3. Found that in order to support exactly-once guarantee spark creates
_spark_metadata folder, which ends up growing too large as the streaming app is
SUPPOSE TO run FOREVER. But when the streaming app runs for a
Hello all,
I have created a Kafka topic with 5 partitions. And I am using createStream
receiver API like following. But somehow only one receiver is getting the
input data. Rest of receivers are not processign anything. Can you please help?
JavaPairDStream messages = null;
I am running a spark streaming process where I am getting batch of data after n
seconds. I am using repartition to scale the application. Since the repartition
size is fixed we are getting lots of small files when batch size is very small.
Is there anyway I can change the partitioner logic
Summary:
I am running Spark 1.5 on CDH5.5.1. Under extreme load intermittently I am
getting this connection failure exception and later negative executor in the
Spark UI.
Exception:
TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime:
76ms
INFO :
I am trying to find the root cause of recent Spark application failure in
production. When the Spark application is running I can check NodeManager's
yarn.nodemanager.log-dir property to get the Spark executor container logs.
The container has logs for both the running Spark applications
Here
Hello All,
I have written a simple program to get data from
JavaDStream textStream = jssc.textFileStream();
JavaDStream ceRDD = textStream.map(
new Function() {
public String call(String ceData) throws Exception {
System.out.println(ceData);
}
});
}
My code works file when we
I am trying to integrate SparkStreaming with HBase. I am calling following APIs
to connect to HBase
HConnection hbaseConnection = HConnectionManager.createConnection(conf);
hBaseTable = hbaseConnection.getTable(hbaseTable);
Since I cannot get the connection and broadcast the connection each
I am trying to run an application in yarn cluster mode.
Spark-Submit with Yarn Cluster
Here are setting of the shell script:
spark-submit --class "com.Myclass" \
--num-executors 2 \
--executor-cores 2 \
--master yarn \
--supervise \
--deploy-mode cluster \
../target/ \
My application is working
I am trying to run following code using yarn-client mode in but getting slow
readprocessor error mentioned below but the code works just fine in the local
mode. Any pointer is really appreciated.
Line of code to receive data from the Kafka Queue:
JavaPairReceiverInputDStream
.scala:237)
at
com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(CheckFeatureImportance.java:49)
From: Rachana Srivastava
Sent: Wednesday, January 13, 2016 3:30 PM
To: 'user@spark.apache.org'; 'd...@spark.apache.org'
Subject: Random Forest FeatureImportance throwing NullPointerException
I have a Random fo
duler.DAGScheduler - Job 1 finished: foreachRDD at
KafkaURLStreaming.java:90, took 0.151210 s
&&&&&&&&&&&&&&&&&&&&& AFTER COUNT OF ACCUMULATOR IS 1
-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthr
I have a very simple two lines program. I am getting input from Kafka and save
the input in a file and counting the input received. My code looks like this,
when I run this code I am getting two accumulator count for each input.
HashMap kafkaParams = new HashMap
Hello All,
I am running my application on Spark cluster but under heavy load the system is
hung due to deadlock. I found similar issues resolved here
https://datastax-oss.atlassian.net/browse/JAVA-555 in Spark version 2.1.3.
But I am running on Spark 1.3 still getting the same issue.
Here
I have recently upgraded spark version but when I try to run save a random
forest model using model save command I am getting nosuchmethoderror. My code
works fine with 1.3x version.
model.save(sc.sc(), "modelsavedir");
ERROR:
Issue:
I have a random forest model that am trying to load during streaming using
following code. The code is working fine when I am running the code from
Eclipse but getting NPE when running the code using spark-submit.
JavaStreamingContext jssc = new JavaStreamingContext(jsc,
I am trying to dynamically create a new class in Spark using javaassist API.
The code seems very simple just invoking makeClass API on a hardcoded class
name. The code works find outside Spark environment but getting this
chedkNotFrozen exception when I am running the code inside Spark
Code
I am trying to submit a job using yarn-cluster mode using spark-submit command.
My code works fine when I use yarn-client mode.
Cloudera Version:
CDH-5.4.7-1.cdh5.4.7.p0.3
Command Submitted:
spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming" \
--driver-java-options
Hello all,
I am trying to test JavaKafkaWordCount on Yarn, to make sure Yarn is working
fine I am saving the output to hdfs. The example works fine in local mode but
not on yarn mode. I cannot see any output logged when I changed the mode to
yarn-client or yarn-cluster or cannot find any
Hello all,
Goal: I want to use APIs from HttpClient library 4.4.1. I am using maven
shaded plugin to generate JAR.
Findings: When I run my program as a java application within eclipse everything
works fine. But when I am running the program using spark-submit I am getting
following
Hello all,
I am working a problem that requires us to create different set of JavaRDD
based on different input arguments. We are getting following error when we try
to use a factory to create JavaRDD. Error message is clear but I am wondering
is there any workaround.
Question:
How to create
Hello all,
Can we invoke JavaRDD while processing stream from Kafka for example.
Following code is throwing some serialization exception. Not sure if this is
feasible.
JavaStreamingContext jssc = new JavaStreamingContext(jsc,
Durations.seconds(5));
JavaPairReceiverInputDStream
Question: How does Spark support multiple context?
Background: I have a stream of data coming to Spark from Kafka. For each
data in the stream I want to download some files from HDFS and process the file
data. I have written code to process the file from HDFS and I have code
written to