date:20181009

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek

There is documentation here http://spark.apache.org/docs/latest/running-on-yarn.html about running spark on YARN. Like I said before you can use either the logs from the application or the Spark UI to understand how many executors are running at any given time. I don't think I can help much

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Jörn Franke

Generally please avoid System.out.println, but use a logger -even for examples. People may take these examples from here and put it in their production code. > Am 09.10.2018 um 15:39 schrieb Shubham Chaurasia : > > Alright, so it is a big project which uses a SQL store underneath. > I extracted

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Hyukjin Kwon

I took a look for the codes. val source = classOf[MyDataSource].getCanonicalName spark.read.format(source).load().collect() Looks indeed it calls twice. First all: Looks it creates it first to read the schema for a logical plan

PySpark Streaming : Accessing the Remote Secured Kafka

2018-10-09 Thread Ramaswamy, Muthuraman

All, Currently, I am using PySpark Streaming (Classic Regular DStream Style and not Structured Streaming). Now, our remote Kafka is secured with Kerberos. To enable PySpark Streaming to access the secured Kafka, what steps I should perform? Can I pass the principal/keytab and jaas config in

Re: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Nirav Patel

Thanks Shuporno . That mode worked. I found out couple records having quotes inside quotes which needed to be escaped. On Tue, Oct 9, 2018 at 1:40 PM Taylor Cox wrote: > Hey Nirav, > > > > Here’s an idea: > > > > Suppose your file.csv has N records, one for each line. Read the csv >

Re: [K8S] Option to keep the executor pods after job finishes

2018-10-09 Thread Yinan Li

There is currently no such an option. But this has been raised before in https://issues.apache.org/jira/browse/SPARK-25515. On Tue, Oct 9, 2018 at 2:17 PM Li Gao wrote: > Hi, > > Is there an option to keep the executor pods on k8s after the job > finishes? We want to extract the logs and stats

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta

Hi Dillon, I do think that there is a setting available where in once YARN sets up the containers then you do not deallocate them, I had used it previously in HIVE, and it just saves processing time in terms of allocating containers. That said I am still trying to understand how do we determine

[K8S] Option to keep the executor pods after job finishes

2018-10-09 Thread Li Gao

Hi, Is there an option to keep the executor pods on k8s after the job finishes? We want to extract the logs and stats before removing the executor pods. Thanks, Li

RE: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Taylor Cox

Hey Nirav, Here’s an idea: Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s)

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek

I'm still not sure exactly what you are meaning by saying that you have 6 yarn containers. Yarn should just be aware of the total available resources in your cluster and then be able to launch containers based on the executor requirements you set when you submit your job. If you can, I think it

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Gourav Sengupta

hi, may be I am not quite clear in my head on this one. But how do we know that 1 yarn container = 1 executor? Regards, Gourav Sengupta On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek wrote: > Can you send how you are launching your streaming process? Also what > environment is this cluster

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek

Can you send how you are launching your streaming process? Also what environment is this cluster running in (EMR, GCP, self managed, etc)? On Tue, Oct 9, 2018 at 10:21 AM kant kodali wrote: > Hi All, > > I am using Spark 2.3.1 and using YARN as a cluster manager. > > I currently got > > 1) 6

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on wo

2018-10-09 Thread zakhavan

Hello, I'm trying to calculate the Pearson correlation between two DStreams using sliding window in Pyspark. But I keep getting the following error: Traceback (most recent call last): File "/home/zeinab/spark-2.3.1-bin-hadoop2.7/examples/src/main/python/streaming/Cross-Corr.py", line 63, in

Does spark.streaming.concurrentJobs still exist?

2018-10-09 Thread kant kodali

Does spark.streaming.concurrentJobs still exist? spark.streaming.concurrentJobs (default: 1) is the number of concurrent jobs, i.e. threads in streaming-job-executor thread pool

Re: Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3

Yes each of the executors have 60GB -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread kant kodali

Hi All, I am using Spark 2.3.1 and using YARN as a cluster manager. I currently got 1) 6 YARN containers(executors=6) with 4 executor cores for each container. 2) 6 Kafka partitions from one topic. 3) You can assume every other configuration is set to whatever the default values are. Spawned a

Re: Any way to see the size of the broadcast variable?

2018-10-09 Thread Gourav Sengupta

Hi Venkat, do you executors have that much amount of memory? Regards, Gourav Sengupta On Tue, Oct 9, 2018 at 4:44 PM V0lleyBallJunki3 wrote: > Hello, >I have set the value of spark.sql.autoBroadcastJoinThreshold to a very > high value of 20 GB. I am joining a table that I am sure is below

Any way to see the size of the broadcast variable?

2018-10-09 Thread V0lleyBallJunki3

Hello, I have set the value of spark.sql.autoBroadcastJoinThreshold to a very high value of 20 GB. I am joining a table that I am sure is below this variable, however spark is doing a SortMergeJoin. If I set a broadcast hint then spark does a broadcast join and job finishes much faster.

Internal Spark class is not registered by Kryo

2018-10-09 Thread 曹礼俊

Hi all: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register() and

Re: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Shuporno Choudhury

Hi, There is a way to way obtain these malformed/rejected records. Rejection can happen not only because of column number mismatch but also if the data type of the data does not match the data type mentioned in the schema. To obtain the rejected records, you can do the following: 1. Add an extra

Internal Spark class is not registered by Kryo

2018-10-09 Thread 曹礼俊

Hi all: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register() and

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia

Alright, so it is a big project which uses a SQL store underneath. I extracted out the minimal code and made a smaller project out of it and still it is creating multiple instances. Here is my project: ├── my-datasource.iml ├── pom.xml ├── src │ ├── main │ │ ├── java │ │ │ └── com │

Internal Spark class is not registered by Kryo

2018-10-09 Thread BOT

Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by

Internal Spark class is not registered by Kryo

2018-10-09 Thread BOT

Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Internal Spark class is not registered by Kryo

2018-10-09 Thread Lijun Cao

Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Spark internal class is not registered by Kryo

2018-10-09 Thread Lijun Cao

Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

Internal Spark class is not registered by Kryo

2018-10-09 Thread Lijun Cao

Hi developers: I have set spark.kryo.registrationRequired=true, but an exception occured: java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage when I run the program. I tried to register it manually by kryo.register()

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf

I am using v2.4.0-RC2 The code as is wouldn’t run (e.g. planBatchInputPartitions returns null). How are you calling it? When I do: Val df = spark.read.format(mypackage).load().show() I am getting a single creation, how are you creating the reader? Thanks, Assaf From: Shubham Chaurasia

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia

Thanks Assaf, you tried with *tags/v2.4.0-rc2?* Full Code: MyDataSource is the entry point which simply creates Reader and Writer public class MyDataSource implements DataSourceV2, WriteSupport, ReadSupport, SessionConfigSupport { @Override public DataSourceReader

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Mendelson, Assaf

Could you add a fuller code example? I tried to reproduce it in my environment and I am getting just one instance of the reader… Thanks, Assaf From: Shubham Chaurasia [mailto:shubh.chaura...@gmail.com] Sent: Tuesday, October 9, 2018 9:31 AM To: user@spark.apache.org Subject:

DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Shubham Chaurasia

Hi All, --Spark built with *tags/v2.4.0-rc2* Consider following DataSourceReader implementation: public class MyDataSourceReader implements DataSourceReader, SupportsScanColumnarBatch { StructType schema = null; Map options; public MyDataSourceReader(Map options) {

SparkR issue

2018-10-09 Thread ayan guha

Hi We are seeing some weird behaviour in Spark R. We created a R Dataframe with 600K records and 29 columns. Then we tried to convert R DF to SparkDF using df <- SparkR::createDataFrame(rdf) from RStudio. It hanged, we had to kill the process after 1-2 hours. We also tried following: df <-

Re: Spark on YARN not utilizing all the YARN containers available

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

PySpark Streaming : Accessing the Remote Secured Kafka

Re: CSV parser - is there a way to find malformed csv record

Re: [K8S] Option to keep the executor pods after job finishes

Re: Spark on YARN not utilizing all the YARN containers available

[K8S] Option to keep the executor pods after job finishes

RE: CSV parser - is there a way to find malformed csv record

Re: Spark on YARN not utilizing all the YARN containers available

Re: Spark on YARN not utilizing all the YARN containers available

Re: Spark on YARN not utilizing all the YARN containers available

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on wo

Does spark.streaming.concurrentJobs still exist?

Re: Any way to see the size of the broadcast variable?

Spark on YARN not utilizing all the YARN containers available

Re: Any way to see the size of the broadcast variable?

Any way to see the size of the broadcast variable?

Internal Spark class is not registered by Kryo

Re: CSV parser - is there a way to find malformed csv record

Internal Spark class is not registered by Kryo

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

Internal Spark class is not registered by Kryo

Internal Spark class is not registered by Kryo

Internal Spark class is not registered by Kryo

Spark internal class is not registered by Kryo

Internal Spark class is not registered by Kryo

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

RE: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

SparkR issue

32 matches

Site Navigation

Mail list logo

Footer information