programmatically set hadoop_conf_dir for spark

2018-11-15 Thread 数据与人工智能产品开发部
Hi, I know we can set  hadoop_conf_dir in spark-env.sh , but we want to set hadoop_conf_dir and hive_home for spark in java code to match different cluster , is there a way to set spark-env in program ?Thanks for any

spark in jupyter cannot find a class in a jar

2018-11-15 Thread Lian Jiang
I am using spark in Jupyter as below: import findspark findspark.init() from pyspark import SQLContext, SparkContext sqlCtx = SQLContext(sc) df = sqlCtx.read.parquet("oci://mybucket@mytenant/myfile.parquet") The error is: Py4JJavaError: An error occurred while calling o198.parquet. :

Re: writing to local files on a worker

2018-11-15 Thread Steve Lewis
I looked at Java's mechanism for creating temporary local files. I believe they can be created, written to and passed to other programs on the system. I wrote a proof of concept to send some Strings out and use the local program cat to concatenate them and write the result to a local file .

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
So 164 GB of parquet data –can potentially explode to upto 1000 GB data if the data is compressed (in practice it would be more like 400-600 GB) Your executors have about 96 GB data. With that kind of volume, 100-300 executors is ok (I would do tests with 100-300), but 30k shuffle partitions are

Re: Testing Apache Spark applications

2018-11-15 Thread Lars Albertsson
My previous answers to this question can be found in the archives, along with some other responses: http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html https://www.mail-archive.com/user%40spark.apache.org/msg48032.html I have made a couple of presentations on the

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
Small update, my initial estimate was incorrect. I have one location with 16*4G = 64G parquests (in snappy) + 20 * 5G = 100G parquets. So a total of 164G. I am running on Databricks. Here are some settings: spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=256m -XX:+UseCodeCacheFlushing

Re: Testing Apache Spark applications

2018-11-15 Thread Vitaliy Pisarev
Hard to answer in a succinct manner but I'll give it a shot. Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to behaviour driven development, BDD). It is not a mere *technical* approach to testing but a mindset, a way of work and a different (different, whether it is

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
While there is some merit to that thought process, I would steer away from premature JVM GC optimization of this kind. What are the memory, cpu and other settings (e.g. any JVM/GC settings) for the executors and driver? So assuming that you are reading about 16 files of say 2-4 GB each, that’s

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
Agree, and I will try it. One clarification though: the amount of partitions also affects their in memory size. So fewer partitions may result in higher memory preassure and Ooms. I think this was the original intention. So the motivation for partitioning is also to break down volumes yo fit the

Re: Testing Apache Spark applications

2018-11-15 Thread ☼ R Nair
Sparklens from qubole is a good source. Other tests are to be handled by developer. Best, Ravi On Thu, Nov 15, 2018, 12:45 PM Hi all, > > > > How are you testing your Spark applications? > > We are writing features by using Cucumber. This is testing the behaviours. > Is this called functional

Using cosinSimilarity method for getting pairwise documents similarity

2018-11-15 Thread Soheil Pourbafrani
Hi, I got the TF-IDF vector for the documents and store it in an RDD and convert into RowMatrix type: val mat = new RowMatrix(tweets_tfidf) Every element of RDD is a sparse Vector related to a document. The problem is the *cosinSimilarity *compute the similarity between columns. Is there any

Using columnSimilarity with threshold result in greater than one

2018-11-15 Thread Soheil Pourbafrani
Testing the *columnSimilarity *method in Spark, I create a *RowMatrix * object: val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0), (4.0, 5.0, 10.0), (1.0,3.0, 6.0))) val rows = temp.map(line => { Vectors.dense(Array(line._1, line._2, line._3)) }) val mat = new RowMatrix(rows)

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Shahbaz
30k Sql shuffle partitions is extremely high.Core to Partition is 1 to 1 ,default value of Sql shuffle partitions is 200 ,set it to 300 or leave it to default ,see which one gives best performance,after you do that ,see how cores are being used? Regards, Shahbaz On Thu, Nov 15, 2018 at 10:58

Testing Apache Spark applications

2018-11-15 Thread Omer.Ozsakarya
Hi all, How are you testing your Spark applications? We are writing features by using Cucumber. This is testing the behaviours. Is this called functional test or integration test? We are also planning to write unit tests. For instance we have a class like below. It has one method. This methos

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the workload from an engineer that is no longer around and am trying to make sense of things in general. On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev < vitaliy.pisa...@biocatch.com> wrote: > The quest is dual: > > >-

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
The quest is dual: - Increase utilisation- because cores cost money and I want to make sure that if I fully utilise what I pay for. This is very blunt of corse, because there is always i/o and at least some degree of skew. Bottom line is do the same thing over the same time but with

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
For that little data, I find spark.sql.shuffle.partitions = 3 to be very high. Any reason for that high value? Do you have a baseline observation with the default value? Also, enabling the jobgroup and job info through the API and observing through the UI will help you understand the code

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
I am working with parquets and the metadata reading there is quite fast as there are at most 16 files (a couple of gigs each). I find it very hard to answer the question: "how many partitions do you have?", many spark operations do not preserve partitioning and I have a lot of filtering and

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Thakrar, Jayesh
Can you shed more light on what kind of processing you are doing? One common pattern that I have seen for active core/executor utilization dropping to zero is while reading ORC data and the driver seems (I think) to be doing schema validation. In my case I would have hundreds of thousands of

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Brandon Geise
I recently came across this (haven’t tried it out yet) but maybe it can help guide you to identify the root cause. https://github.com/groupon/sparklint From: Vitaliy Pisarev Date: Thursday, November 15, 2018 at 10:08 AM To: user Cc: David Markovitz Subject: How to address seemingly low

How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
I have a workload that runs on a cluster of 300 cores. Below is a plot of the amount of active tasks over time during the execution of this workload: [image: image.png] What I deduce is that there are substantial intervals where the cores are heavily under-utilised. What actions can I take to:

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
That is precisely my question- what kind of leads can I look at to get a hint of where the inefficiencies lay? On Thu, Nov 15, 2018 at 4:56 PM David Markovitz < dudu.markov...@microsoft.com> wrote: > It seems it is almost fully utilized – when it is active. > > What happens in the gaps, where

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream of getting client mode with Jupyternotebook to work on GCP/GKE : https://www.youtube.com/watch?v=eMj0Pv1-Nfo=3=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi wrote: > Hi Li, > > > > Thank

[Spark SQL] [Spark 2.4.0] v1 -> struct(v1.e) fails

2018-11-15 Thread François Sarradin
Hi, I've this JSON document : { "b": [ { "e": 1 } ] } When I do : df.select(expr("transform( b, v1 -> struct(v1.e) )")) I get this error : cannot resolve 'named_struct(NamePlaceholder(), namedlambdavariable().e)' due to data type mismatch: Only foldable string expressions are allowed to

Measure Serialization / De-serialization Time

2018-11-15 Thread Jack Kolokasis
Hello all,     I am running a simple Word Count application using storage level MEMORY_ONLY in one case and OFF_HEAP on the other. I see that the execution time while I ran my application off-heap is higher than on-heap. So, I am looking where this time goes. One thought at first is that