Re: SizeEstimator

2018-02-26 Thread 叶先进
What type is for the buffer you mentioned? > On 27 Feb 2018, at 11:46 AM, David Capwell wrote: > > advancedxy , I don't remember the code as well > anymore but what we hit was a very simple schema (string, long). The issue is > the buffer had

Re: SizeEstimator

2018-02-26 Thread David Capwell
advancedxy , I don't remember the code as well anymore but what we hit was a very simple schema (string, long). The issue is the buffer had a million of these so SizeEstimator of the buffer had to keep recalculating the same elements over and over again. SizeEstimator was

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks David. Another solution is to convert the protobuf object to byte array, It does speed up SizeEstimator On Mon, Feb 26, 2018 at 5:34 PM, David Capwell wrote: > This is used to predict the current cost of memory so spark knows to flush > or not. This is very costly for

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks! Our protobuf object is fairly complex. Even O(N) takes a lot of time. On Mon, Feb 26, 2018 at 6:33 PM, 叶先进 wrote: > H Xin Liu, > > Could you provide a concrete user case if possible(code to reproduce > protobuf object and comparisons between protobuf and normal

Re: SizeEstimator

2018-02-26 Thread 叶先进
H Xin Liu, Could you provide a concrete user case if possible(code to reproduce protobuf object and comparisons between protobuf and normal object)? I contributed a bit to SizeEstimator years ago, and to my understanding, the time complexity should be O(N) where N is the num of referenced

Re: SizeEstimator

2018-02-26 Thread David Capwell
This is used to predict the current cost of memory so spark knows to flush or not. This is very costly for us so we use a flag marked in the code as private to lower the cost spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no typo) - how many records before flush This lowers

SizeEstimator

2018-02-26 Thread Xin Liu
Hi folks, We have a situation where, shuffled data is protobuf based, and SizeEstimator is taking a lot of time. We have tried to override SizeEstimator to return a constant value, which speeds up things a lot. My questions, what is the side effect of disabling SizeEstimator? Is it just spark

how to add columns to row when column has a different encoder?

2018-02-26 Thread David Capwell
I have a row that looks like the following pojo case class Wrapper(var id: String, var bytes: Array[Byte]) Those bytes are a serialized pojo that looks like this case class Inner(var stuff: String, var moreStuff: String) I right now have encoders for both the types, but I don't see how to

Re: Out of memory Error when using Collection Accumulator Spark 2.2

2018-02-26 Thread naresh Goud
what is your driver memory? Thanks, Naresh www.linkedin.com/in/naresh-dulam http://hadoopandspark.blogspot.com/ On Mon, Feb 26, 2018 at 3:45 AM, Patrick wrote: > Hi, > > We were getting OOM error when we are accumulating the results of each > worker. We were trying to

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread yncxcw
hi, all I also noticed this problem. The reason is that Yarn accounts each executor for only 1, no matter how many cores you configured. Because Yarn only uses memory as the primary metrics for resource allocation. It means that Yarn will pack as many as executors on each node as long as the

Re: partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky
Yeah, was just discussing this with a co-worker and came to the same conclusion -- need to essentially create a copy of the partition column. Thanks. Hacky, but it works. Seems counter-intuitive that Spark would remove the column from the output... should at least give you an option to keep it.

Re: partitionBy with partitioned column in output?

2018-02-26 Thread naresh Goud
is this helps? sc.parallelize(List((1,10),(2, 20))).toDF("foo","bar").map(("foo","bar")=>("foo",("foo","bar"))). partitionBy("foo").json("json-out") On Mon, Feb 26, 2018 at 4:28 PM, Alex Nastetsky wrote: > Is there a way to make outputs created with "partitionBy" to

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Patrick Alwell
+1 AFAIK, vCores are not the same as Cores in AWS. https://samrueby.com/2015/01/12/what-are-amazon-aws-vcpus/ I’ve always understood it as cores = num concurrent threads These posts might help you with your research and why exceeding 5 cores per executor doesn’t make sense.

partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky
Is there a way to make outputs created with "partitionBy" to contain the partitioned column? When reading the output with Spark or Hive or similar, it's less of an issue because those tools know how to perform partition discovery. But if I were to load the output into an external data warehouse or

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Vadim Semenov
yeah, for some reason (unknown to me, but you can find on aws forums) they double the actual number of cores for nodemanagers. I assume that's done to maximize utilization, but doesn't really matter to me, at least, since I only run Spark, so I, personally, set `total number of cores - 1/2`

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Selvam Raman
Thanks. That’s make sense. I want to know one more think , available vcore per machine is 16 but threads per node 8. Am I missing to relate here. What I m thinking now is number of vote = number of threads. On Mon, 26 Feb 2018 at 18:45, Vadim Semenov wrote: > All used

Re: Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Jenna Hoole
Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running now :) Thank you so much! Jenna On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li wrote: > OK, it looks like you will need to use > `file:///var/spark-data/spark-files/flights.csv` > instead. The

Re: Unsubscribe

2018-02-26 Thread Romero, Saul
Unsubscribe On Mon, Feb 26, 2018 at 8:58 AM, Mallanagouda Patil < mallanagouda.c.pa...@gmail.com> wrote: > Unsubscribe >

Re: Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Yinan Li
The files specified through --files are localized by the init-container to /var/spark-data/spark-files by default. So in your case, the file should be located at /var/spark-data/spark-files/flights.csv locally in the container. On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole

Spark on K8s - using files fetched by init-container?

2018-02-26 Thread Jenna Hoole
This is probably stupid user error, but I can't for the life of me figure out how to access the files that are staged by the init-container. I'm trying to run the SparkR example data-manipulation.R which requires the path to its datafile. I supply the hdfs location via --files and then the full

Unsubscribe

2018-02-26 Thread purna pradeep
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread akshay naidu
Putting all cores won't solve the purpose alone, you'll have to mention executors as well executor memory accordingly to it.. On Tue 27 Feb, 2018, 12:15 AM Vadim Semenov, wrote: > All used cores aren't getting reported correctly in EMR, and YARN itself > has no control over

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Vadim Semenov
All used cores aren't getting reported correctly in EMR, and YARN itself has no control over it, so whatever you put in `spark.executor.cores` will be used, but in the ResourceManager you will only see 1 vcore used per nodemanager. On Mon, Feb 26, 2018 at 5:20 AM, Selvam Raman

Re: Can spark handle this scenario?

2018-02-26 Thread Lian Jiang
Thanks Vijay. After changing the programming model (create a context class for the workers), it finally worked for me. Cheers. On Fri, Feb 23, 2018 at 5:42 PM, vijay.bvp wrote: > when HTTP connection is opened you are opening a connection between > specific > machine (with

Unsubscribe

2018-02-26 Thread Mallanagouda Patil
Unsubscribe

RE: spark 2 new stuff

2018-02-26 Thread Stefan Panayotov
To me Delta is very valuable. Stefan Panayotov, PhD spanayo...@outlook.com spanayo...@comcast.net Cell: 610-517-5586 From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Monday, February 26, 2018 9:26 AM To: user @spark

spark 2 new stuff

2018-02-26 Thread Mich Talebzadeh
just a quick query. >From a practitioner's point of view what new stuff of Spark 2 have been most value for money. I hear different and often conflicting stuff.Hhowever, I was wondering if the user group has more practical takes. regards, Dr Mich Talebzadeh LinkedIn *

Re: Trigger.ProcessingTime("10 seconds") & Trigger.Continuous(10.seconds)

2018-02-26 Thread naresh Goud
Thanks, I'll check it out. On Mon, Feb 26, 2018 at 12:11 AM Tathagata Das wrote: > The continuous one is our new low latency continuous processing engine in > Structured Streaming (to be released in 2.3). > Here is the pre-release doc - >

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Selvam Raman
Hi Fawze, Yes, it is true that i am running in yarn mode, 5 containers represents 4executor and 1 master. But i am not expecting this details as i already aware of this. What i want to know is relationship between Vcores(Emr yarn) vs executor-core(Spark). >From my slave configuration i

Unsubscribe

2018-02-26 Thread purna pradeep
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Data loss in spark job

2018-02-26 Thread Faraz Mateen
Hi, I think I have a situation where spark is silently failing to write data to my Cassandra table. Let me explain my current situation. I have a table consisting of around 402 million records. The table consists of 84 columns. Table schema is something like this: *id (text) | datetime

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Fawze Abujaber
It's recommended to sue executor-cores of 5. Each executor here will utilize 20 GB which mean the spark job will utilize 50 cpu cores and 100GB memory. You can not run more than 4 executors because your cluster doesn't have enough memory. Use see 5 executor because 4 for the job and one for the

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Selvam Raman
Master Node details: lscpu Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):4 On-line CPU(s) list: 0-3 Thread(s) per core:4 Core(s) per socket:1 Socket(s): 1 NUMA node(s): 1 Vendor ID:

Spark EMR executor-core vs Vcores

2018-02-26 Thread Selvam Raman
Hi, spark version - 2.0.0 spark distribution - EMR 5.0.0 Spark Cluster - one master, 5 slaves Master node - m3.xlarge - 8 vCore, 15 GiB memory, 80 SSD GB storage Slave node - m3.2xlarge - 16 vCore, 30 GiB memory, 160 SSD GB storage Cluster Metrics Apps SubmittedApps PendingApps RunningApps

Out of memory Error when using Collection Accumulator Spark 2.2

2018-02-26 Thread Patrick
Hi, We were getting OOM error when we are accumulating the results of each worker. We were trying to avoid collecting data to driver node instead used accumulator as per below code snippet, Is there any spark config to set the accumulator settings Or am i doing the wrong way to collect the huge

Is there a way to query dataframe views directly without going through scheduler?

2018-02-26 Thread kant kodali
Hi All, I wonder if there is a way to query data frame views directly without going through scheduler? for example. say I have the following code DataSet kafkaDf = session.readStream().format("kafka").load(); kafkaDf.createOrReplaceView("table") Now Can I query the view "table" without going