Re: Can reduced parallelism lead to no shuffle spill?

2019-11-07 Thread Alexander Czech
Why don't you just repartion the dataset ? If partion are really that unevenly sized you should probably do that first. That potentially also saves a lot of trouble later on. On Thu, Nov 7, 2019 at 5:14 PM V0lleyBallJunki3 wrote: > Consider an example where I have a cluster with 5 nodes and each

How to Load a Graphx Graph from a parquet file?

2019-08-29 Thread Alexander Czech
Hey all, I want to load a parquet containing my edges into an Graph my code so far looks like this: val edgesDF = spark.read.parquet("/path/to/edges/parquet/") val edgesRDD = edgesDF.rdd val graph = Graph.fromEdgeTuples(edgesRDD, 1) But simply this produces an error: [error] found : org.apac

GraphFrames cluster staling with a large dataset and pyspark

2019-08-26 Thread Alexander Czech
Hey All, I'm trying to run a pagerank with GraphFrames on a large graph (about 90 billion edges and 1.4TB total disk size) and I'm running into some issues. The code is very simplistic it's just load dataframes from S3 and put them into the GraphFrames pagerank function. But when I run it the clust

How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?

2019-07-14 Thread Alexander Czech
As the subject suggest I want to output an parquet to S3. I know this was rather troublesome in the past because of S3 not having a move but needed to do a copy+delete. This issues has been discussed before see: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-files-to-s3-with-out-tempor

How to use the Graphframe PageRank method with dangling edges?

2018-11-05 Thread Alexander Czech
I have graph that has a couple of dangling edges. I use pyspark and work with spark 2.2.0. It kind of looks like this: g.vertices.show() +---+ | id| +---+ | 1| | 2| | 3| | 4| +---+ g.edges.show() +---++ |src| dst| +---++ | 1| 2| | 2| 3| | 3| 4| | 4| 1| | 4|null| +---++

Re: Repartition not working on a csv file

2018-07-01 Thread Alexander Czech
You could try to force a repartion right at that point by producing a cached version of the DF with .cache() if memory allows it. On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari wrote: > I've tried that too - it doesn't work. It does a repetition, but not right > after the broadcast join - it do

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
> are not careful it can end up burning you financially). > > Regards, > Gourav Sengupta > > On Mon, Nov 27, 2017 at 12:58 PM, Alexander Czech < > alexander.cz...@googlemail.com> wrote: > >> I don't use EMR I spin my clusters up using flintrock (beeing a stud

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
n that > you are using. > > > On another note also mention the AWS Region you are in. If Redshift > Spectrum is available, or you can use Athena, or you can use Presto, then > running massive aggregates over huge data sets at fraction of cost and at > least 10x speed may be

Re: Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
,vector,text Basically there is one very big shuffle going on the rest is not that heavy. The CPU intense lifting was done before that. On Mon, Nov 27, 2017 at 12:03 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > I have a temporary result file ( the 10TB one) that looks like

Loading a large parquet file how much memory do I need

2017-11-27 Thread Alexander Czech
I want to load a 10TB parquet File from S3 and I'm trying to decide what EC2 instances to use. Should I go for instances that in total have a larger memory size than 10TB? Or is it enough that they have in total enough SSD storage so that everything can be spilled to disk? thanks

Re: HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
: > How many files you produce? I believe it spends a lot of time on renaming > the files because of the output committer. > Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they > have 10GbE and you can get good throughput for S3. > > On Fri, Sep 29, 2017 at 9:15 A

Re: More instances = slower Spark job

2017-09-29 Thread Alexander Czech
Does each gzip file look like this: {json1} {json2} {json3} meaning that each line is a separate json object? I proccess a similar large file batch and what I do is this: input.txt # each line in input.txt represents a path to a gzip file each containing a json object every line my_rdd = sc.par

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Alexander Czech
Yes you need to store the file at a location where it is equally retrievable ("same path") for the master and all nodes in the cluster. A simple solution (apart from a HDFS) that does not scale to well but might be a OK with only 3 nodes like in your configuration is a network accessible storage (a

HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3. But the S3 performance for various reasons is bad when I access s3 through the parquet write method: df.write.parquet('s3a://bucket/parquet') Now I want to setup a small cache for the parquet output. One o

Re: for loops in pyspark

2017-09-21 Thread Alexander Czech
ot; wrote: Spark manage memory allocation and release automatically. Can you post the complete program which help checking where is wrong ? On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech < alexander.cz...@googlemail.com> wrote: > Hello all, > > I'm running a pyspark script

for loops in pyspark

2017-09-20 Thread Alexander Czech
Hello all, I'm running a pyspark script that makes use of for loop to create smaller chunks of my main dataset. some example code: for chunk in chunks: my_rdd = sc.parallelize(chunk).flatmap(somefunc) # do some stuff with my_rdd my_df = make_df(my_rdd) # do some stuff with my_df

Restarting the SparkContext in pyspark

2017-08-24 Thread Alexander Czech
I'm running a Jupyter-Spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext, here is some code: *temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')