There is documentation here
http://spark.apache.org/docs/latest/running-on-yarn.html about running
spark on YARN. Like I said before you can use either the logs from the
application or the Spark UI to understand how many executors are running at
any given time. I don't think I can help much
Generally please avoid System.out.println, but use a logger -even for examples.
People may take these examples from here and put it in their production code.
> Am 09.10.2018 um 15:39 schrieb Shubham Chaurasia :
>
> Alright, so it is a big project which uses a SQL store underneath.
> I extracted
I took a look for the codes.
val source = classOf[MyDataSource].getCanonicalName
spark.read.format(source).load().collect()
Looks indeed it calls twice.
First all: Looks it creates it first to read the schema for a logical plan
All,
Currently, I am using PySpark Streaming (Classic Regular DStream Style and not
Structured Streaming). Now, our remote Kafka is secured with Kerberos.
To enable PySpark Streaming to access the secured Kafka, what steps I should
perform? Can I pass the principal/keytab and jaas config in
Thanks Shuporno . That mode worked. I found out couple records having
quotes inside quotes which needed to be escaped.
On Tue, Oct 9, 2018 at 1:40 PM Taylor Cox wrote:
> Hey Nirav,
>
>
>
> Here’s an idea:
>
>
>
> Suppose your file.csv has N records, one for each line. Read the csv
>
There is currently no such an option. But this has been raised before in
https://issues.apache.org/jira/browse/SPARK-25515.
On Tue, Oct 9, 2018 at 2:17 PM Li Gao wrote:
> Hi,
>
> Is there an option to keep the executor pods on k8s after the job
> finishes? We want to extract the logs and stats
Hi Dillon,
I do think that there is a setting available where in once YARN sets up the
containers then you do not deallocate them, I had used it previously in
HIVE, and it just saves processing time in terms of allocating containers.
That said I am still trying to understand how do we determine
Hi,
Is there an option to keep the executor pods on k8s after the job finishes?
We want to extract the logs and stats before removing the executor pods.
Thanks,
Li
Hey Nirav,
Here’s an idea:
Suppose your file.csv has N records, one for each line. Read the csv
line-by-line (without spark) and attempt to parse each line. If a record is
malformed, catch the exception and rethrow it with the line number. That should
show you where the problematic record(s)
I'm still not sure exactly what you are meaning by saying that you have 6
yarn containers. Yarn should just be aware of the total available resources
in your cluster and then be able to launch containers based on the
executor requirements you set when you submit your job. If you can, I think
it
hi,
may be I am not quite clear in my head on this one. But how do we know that
1 yarn container = 1 executor?
Regards,
Gourav Sengupta
On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek
wrote:
> Can you send how you are launching your streaming process? Also what
> environment is this cluster
Can you send how you are launching your streaming process? Also what
environment is this cluster running in (EMR, GCP, self managed, etc)?
On Tue, Oct 9, 2018 at 10:21 AM kant kodali wrote:
> Hi All,
>
> I am using Spark 2.3.1 and using YARN as a cluster manager.
>
> I currently got
>
> 1) 6
Hello,
I'm trying to calculate the Pearson correlation between two DStreams using
sliding window in Pyspark. But I keep getting the following error:
Traceback (most recent call last):
File
"/home/zeinab/spark-2.3.1-bin-hadoop2.7/examples/src/main/python/streaming/Cross-Corr.py",
line 63, in
Does spark.streaming.concurrentJobs still exist?
spark.streaming.concurrentJobs (default: 1) is the number of concurrent
jobs, i.e. threads in streaming-job-executor thread pool
Yes each of the executors have 60GB
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi All,
I am using Spark 2.3.1 and using YARN as a cluster manager.
I currently got
1) 6 YARN containers(executors=6) with 4 executor cores for each container.
2) 6 Kafka partitions from one topic.
3) You can assume every other configuration is set to whatever the default
values are.
Spawned a
Hi Venkat,
do you executors have that much amount of memory?
Regards,
Gourav Sengupta
On Tue, Oct 9, 2018 at 4:44 PM V0lleyBallJunki3
wrote:
> Hello,
>I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
> high value of 20 GB. I am joining a table that I am sure is below
Hello,
I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
high value of 20 GB. I am joining a table that I am sure is below this
variable, however spark is doing a SortMergeJoin. If I set a broadcast hint
then spark does a broadcast join and job finishes much faster.
Hi all:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I
run the program.
I tried to register it manually by kryo.register() and
Hi,
There is a way to way obtain these malformed/rejected records. Rejection
can happen not only because of column number mismatch but also if the data
type of the data does not match the data type mentioned in the schema.
To obtain the rejected records, you can do the following:
1. Add an extra
Hi all:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by kryo.register() and
Alright, so it is a big project which uses a SQL store underneath.
I extracted out the minimal code and made a smaller project out of it and
still it is creating multiple instances.
Here is my project:
├── my-datasource.iml
├── pom.xml
├── src
│ ├── main
│ │ ├── java
│ │ │ └── com
│
Hi developers:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by
Hi developers:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by kryo.register()
Hi developers:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by kryo.register()
Hi developers:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by kryo.register()
Hi developers:
I have set spark.kryo.registrationRequired=true, but an exception occured:
java.lang.IllegalArgumentException: Class is not registered:
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run
the program.
I tried to register it manually by kryo.register()
I am using v2.4.0-RC2
The code as is wouldn’t run (e.g. planBatchInputPartitions returns null). How
are you calling it?
When I do:
Val df = spark.read.format(mypackage).load().show()
I am getting a single creation, how are you creating the reader?
Thanks,
Assaf
From: Shubham Chaurasia
Thanks Assaf, you tried with *tags/v2.4.0-rc2?*
Full Code:
MyDataSource is the entry point which simply creates Reader and Writer
public class MyDataSource implements DataSourceV2, WriteSupport,
ReadSupport, SessionConfigSupport {
@Override public DataSourceReader
Could you add a fuller code example? I tried to reproduce it in my environment
and I am getting just one instance of the reader…
Thanks,
Assaf
From: Shubham Chaurasia [mailto:shubh.chaura...@gmail.com]
Sent: Tuesday, October 9, 2018 9:31 AM
To: user@spark.apache.org
Subject:
Hi All,
--Spark built with *tags/v2.4.0-rc2*
Consider following DataSourceReader implementation:
public class MyDataSourceReader implements DataSourceReader,
SupportsScanColumnarBatch {
StructType schema = null;
Map options;
public MyDataSourceReader(Map options) {
Hi
We are seeing some weird behaviour in Spark R.
We created a R Dataframe with 600K records and 29 columns. Then we tried to
convert R DF to SparkDF using
df <- SparkR::createDataFrame(rdf)
from RStudio. It hanged, we had to kill the process after 1-2 hours.
We also tried following:
df <-
32 matches
Mail list logo