Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
There is documentation here http://spark.apache.org/docs/latest/running-on-yarn.html about running spark on YARN. Like I said before you can use either the logs from the application or the Spark UI to understand how many executors are running at any given time. I don't think I can help much

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Jörn Franke
Generally please avoid System.out.println, but use a logger -even for examples. People may take these examples from here and put it in their production code. > Am 09.10.2018 um 15:39 schrieb Shubham Chaurasia : > > Alright, so it is a big project which uses a SQL store underneath. > I extracted

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Hyukjin Kwon
I took a look for the codes. val source = classOf[MyDataSource].getCanonicalName spark.read.format(source).load().collect() Looks indeed it calls twice. First all: Looks it creates it first to read the schema for a logical plan

PySpark Streaming : Accessing the Remote Secured Kafka

2018-10-09 Thread Ramaswamy, Muthuraman
All, Currently, I am using PySpark Streaming (Classic Regular DStream Style and not Structured Streaming). Now, our remote Kafka is secured with Kerberos. To enable PySpark Streaming to access the secured Kafka, what steps I should perform? Can I pass the principal/keytab and jaas config in

Re: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Nirav Patel
Thanks Shuporno . That mode worked. I found out couple records having quotes inside quotes which needed to be escaped. On Tue, Oct 9, 2018 at 1:40 PM Taylor Cox wrote: > Hey Nirav, > > > > Here’s an idea: > > > > Suppose your file.csv has N records, one for each line. Read the csv >

Re: [K8S] Option to keep the executor pods after job finishes

2018-10-09 Thread Yinan Li
There is currently no such an option. But this has been raised before in https://issues.apache.org/jira/browse/SPARK-25515. On Tue, Oct 9, 2018 at 2:17 PM Li Gao wrote: > Hi, > > Is there an option to keep the executor pods on k8s after the job > finishes? We want to extract the logs and stats

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta
Hi Dillon, I do think that there is a setting available where in once YARN sets up the containers then you do not deallocate them, I had used it previously in HIVE, and it just saves processing time in terms of allocating containers. That said I am still trying to understand how do we determine

[K8S] Option to keep the executor pods after job finishes

2018-10-09 Thread Li Gao
Hi, Is there an option to keep the executor pods on k8s after the job finishes? We want to extract the logs and stats before removing the executor pods. Thanks, Li

RE: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Taylor Cox
Hey Nirav, Here’s an idea: Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s)

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
I'm still not sure exactly what you are meaning by saying that you have 6 yarn containers. Yarn should just be aware of the total available resources in your cluster and then be able to launch containers based on the executor requirements you set when you submit your job. If you can, I think it

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta
hi, may be I am not quite clear in my head on this one. But how do we know that 1 yarn container = 1 executor? Regards, Gourav Sengupta On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek wrote: > Can you send how you are launching your streaming process? Also what > environment is this cluster

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
Can you send how you are launching your streaming process? Also what environment is this cluster running in (EMR, GCP, self managed, etc)? On Tue, Oct 9, 2018 at 10:21 AM kant kodali wrote: > Hi All, > > I am using Spark 2.3.1 and using YARN as a cluster manager. > > I currently got > > 1) 6

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on wo

2018-10-09 Thread zakhavan
Hello, I'm trying to calculate the Pearson correlation between two DStreams using sliding window in Pyspark. But I keep getting the following error: Traceback (most recent call last): File "/home/zeinab/spark-2.3.1-bin-hadoop2.7/examples/src/main/python/streaming/Cross-Corr.py", line 63, in

Does spark.streaming.concurrentJobs still exist?

2018-10-09 Thread kant kodali
Does spark.streaming.concurrentJobs still exist? spark.streaming.concurrentJobs (default: 1) is the number of concurrent jobs, i.e. threads in streaming-job-executor thread pool

Re: Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3
Yes each of the executors have 60GB -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread kant kodali
Hi All, I am using Spark 2.3.1 and using YARN as a cluster manager. I currently got 1) 6 YARN containers(executors=6) with 4 executor cores for each container. 2) 6 Kafka partitions from one topic. 3) You can assume every other configuration is set to whatever the default values are. Spawned a

Re: Any way to see the size of the broadcast variable?

2018-10-09 Thread Gourav Sengupta
Hi Venkat, do you executors have that much amount of memory? Regards, Gourav Sengupta On Tue, Oct 9, 2018 at 4:44 PM V0lleyBallJunki3 wrote: > Hello, >I have set the value of spark.sql.autoBroadcastJoinThreshold to a very > high value of 20 GB. I am joining a table that I am sure is below

Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3
Hello, I have set the value of spark.sql.autoBroadcastJoinThreshold to a very high value of 20 GB. I am joining a table that I am sure is below this variable, however spark is doing a SortMergeJoin. If I set a broadcast hint then spark does a broadcast join and job finishes much faster.

Internal Spark class is not registered by Kryo

2018-10-09 Thread 曹礼俊
Hi all: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register() and

Re: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Shuporno Choudhury
Hi, There is a way to way obtain these malformed/rejected records. Rejection can happen not only because of column number mismatch but also if the data type of the data does not match the data type mentioned in the schema. To obtain the rejected records, you can do the following: 1. Add an extra

Internal Spark class is not registered by Kryo

2018-10-09 Thread 曹礼俊
Hi all: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register() and

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia
Alright, so it is a big project which uses a SQL store underneath. I extracted out the minimal code and made a smaller project out of it and still it is creating multiple instances. Here is my project: ├── my-datasource.iml ├── pom.xml ├── src │ ├── main │ │ ├── java │ │ │ └── com │

Internal Spark class is not registered by Kryo

2018-10-09 Thread BOT
Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by

Internal Spark class is not registered by Kryo

2018-10-09 Thread BOT
Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Internal Spark class is not registered by Kryo

2018-10-09 Thread Lijun Cao
Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Spark internal class is not registered by Kryo

2018-10-09 Thread Lijun Cao
Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Internal Spark class is not registered by Kryo

2018-10-09 Thread Lijun Cao
Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf
I am using v2.4.0-RC2 The code as is wouldn’t run (e.g. planBatchInputPartitions returns null). How are you calling it? When I do: Val df = spark.read.format(mypackage).load().show() I am getting a single creation, how are you creating the reader? Thanks, Assaf From: Shubham Chaurasia

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia
Thanks Assaf, you tried with *tags/v2.4.0-rc2?* Full Code: MyDataSource is the entry point which simply creates Reader and Writer public class MyDataSource implements DataSourceV2, WriteSupport, ReadSupport, SessionConfigSupport { @Override public DataSourceReader

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf
Could you add a fuller code example? I tried to reproduce it in my environment and I am getting just one instance of the reader… Thanks, Assaf From: Shubham Chaurasia [mailto:shubh.chaura...@gmail.com] Sent: Tuesday, October 9, 2018 9:31 AM To: user@spark.apache.org Subject:

DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia
Hi All, --Spark built with *tags/v2.4.0-rc2* Consider following DataSourceReader implementation: public class MyDataSourceReader implements DataSourceReader, SupportsScanColumnarBatch { StructType schema = null; Map options; public MyDataSourceReader(Map options) {

SparkR issue

2018-10-09 Thread ayan guha
Hi We are seeing some weird behaviour in Spark R. We created a R Dataframe with 600K records and 29 columns. Then we tried to convert R DF to SparkDF using df <- SparkR::createDataFrame(rdf) from RStudio. It hanged, we had to kill the process after 1-2 hours. We also tried following: df <-