Hi,
I know we can set hadoop_conf_dir in spark-env.sh , but we want to set hadoop_conf_dir and hive_home for spark in java code to match different cluster , is there a way to set spark-env in program ?Thanks for any
I am using spark in Jupyter as below:
import findspark
findspark.init()
from pyspark import SQLContext, SparkContext
sqlCtx = SQLContext(sc)
df = sqlCtx.read.parquet("oci://mybucket@mytenant/myfile.parquet")
The error is:
Py4JJavaError: An error occurred while calling o198.parquet.
:
I looked at Java's mechanism for creating temporary local files. I believe
they can be created, written to and passed to other programs on the system.
I wrote a proof of concept to send some Strings out and use the local
program cat to concatenate them and write the result to a local file .
So 164 GB of parquet data –can potentially explode to upto 1000 GB data if the
data is compressed (in practice it would be more like 400-600 GB)
Your executors have about 96 GB data.
With that kind of volume, 100-300 executors is ok (I would do tests with
100-300), but 30k shuffle partitions are
My previous answers to this question can be found in the archives, along
with some other responses:
http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
I have made a couple of presentations on the
Small update, my initial estimate was incorrect. I have one location with
16*4G = 64G parquests (in snappy) + 20 * 5G = 100G parquets. So a total of
164G.
I am running on Databricks.
Here are some settings:
spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m
-XX:+UseCodeCacheFlushing
Hard to answer in a succinct manner but I'll give it a shot.
Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to
behaviour driven development, BDD).
It is not a mere *technical* approach to testing but a mindset, a way of
work and a different (different, whether it is
While there is some merit to that thought process, I would steer away from
premature JVM GC optimization of this kind.
What are the memory, cpu and other settings (e.g. any JVM/GC settings) for the
executors and driver?
So assuming that you are reading about 16 files of say 2-4 GB each, that’s
Agree, and I will try it. One clarification though: the amount of
partitions also affects their in memory size. So fewer partitions may
result in higher memory preassure and Ooms. I think this was the original
intention.
So the motivation for partitioning is also to break down volumes yo fit the
Sparklens from qubole is a good source. Other tests are to be handled by
developer.
Best,
Ravi
On Thu, Nov 15, 2018, 12:45 PM Hi all,
>
>
>
> How are you testing your Spark applications?
>
> We are writing features by using Cucumber. This is testing the behaviours.
> Is this called functional
Hi,
I got the TF-IDF vector for the documents and store it in an RDD and
convert into RowMatrix type:
val mat = new RowMatrix(tweets_tfidf)
Every element of RDD is a sparse Vector related to a document.
The problem is the *cosinSimilarity *compute the similarity between
columns. Is there any
Testing the *columnSimilarity *method in Spark, I create a *RowMatrix *
object:
val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0),
(4.0, 5.0, 10.0), (1.0,3.0, 6.0)))
val rows = temp.map(line => {
Vectors.dense(Array(line._1, line._2, line._3))
})
val mat = new RowMatrix(rows)
30k Sql shuffle partitions is extremely high.Core to Partition is 1 to 1
,default value of Sql shuffle partitions is 200 ,set it to 300 or leave it
to default ,see which one gives best performance,after you do that ,see how
cores are being used?
Regards,
Shahbaz
On Thu, Nov 15, 2018 at 10:58
Hi all,
How are you testing your Spark applications?
We are writing features by using Cucumber. This is testing the behaviours. Is
this called functional test or integration test?
We are also planning to write unit tests.
For instance we have a class like below. It has one method. This methos
Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the
workload from an engineer that is no longer around and am trying to make
sense of things in general.
On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev <
vitaliy.pisa...@biocatch.com> wrote:
> The quest is dual:
>
>
>-
The quest is dual:
- Increase utilisation- because cores cost money and I want to make sure
that if I fully utilise what I pay for. This is very blunt of corse,
because there is always i/o and at least some degree of skew. Bottom line
is do the same thing over the same time but with
For that little data, I find spark.sql.shuffle.partitions = 3 to be very
high.
Any reason for that high value?
Do you have a baseline observation with the default value?
Also, enabling the jobgroup and job info through the API and observing through
the UI will help you understand the code
I am working with parquets and the metadata reading there is quite fast as
there are at most 16 files (a couple of gigs each).
I find it very hard to answer the question: "how many partitions do you
have?", many spark operations do not preserve partitioning and I have a lot
of filtering and
Can you shed more light on what kind of processing you are doing?
One common pattern that I have seen for active core/executor utilization
dropping to zero is while reading ORC data and the driver seems (I think) to be
doing schema validation.
In my case I would have hundreds of thousands of
I recently came across this (haven’t tried it out yet) but maybe it can help
guide you to identify the root cause.
https://github.com/groupon/sparklint
From: Vitaliy Pisarev
Date: Thursday, November 15, 2018 at 10:08 AM
To: user
Cc: David Markovitz
Subject: How to address seemingly low
I have a workload that runs on a cluster of 300 cores.
Below is a plot of the amount of active tasks over time during the
execution of this workload:
[image: image.png]
What I deduce is that there are substantial intervals where the cores are
heavily under-utilised.
What actions can I take to:
That is precisely my question- what kind of leads can I look at to get a
hint of where the inefficiencies lay?
On Thu, Nov 15, 2018 at 4:56 PM David Markovitz <
dudu.markov...@microsoft.com> wrote:
> It seems it is almost fully utilized – when it is active.
>
> What happens in the gaps, where
If folks are interested, while it's not on Amazon, I've got a live stream
of getting client mode with Jupyternotebook to work on GCP/GKE :
https://www.youtube.com/watch?v=eMj0Pv1-Nfo=3=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx
On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi wrote:
> Hi Li,
>
>
>
> Thank
Hi,
I've this JSON document :
{ "b": [ { "e": 1 } ] }
When I do :
df.select(expr("transform( b, v1 -> struct(v1.e) )"))
I get this error :
cannot resolve 'named_struct(NamePlaceholder(), namedlambdavariable().e)'
due to data type mismatch: Only foldable string expressions are allowed to
Hello all,
I am running a simple Word Count application using storage level
MEMORY_ONLY in one case and OFF_HEAP on the other. I see that the
execution time while I ran my application off-heap is higher than
on-heap. So, I am looking where this time goes.
One thought at first is that
25 matches
Mail list logo