Why don't you just repartion the dataset ? If partion are really that
unevenly sized you should probably do that first. That potentially also
saves a lot of trouble later on.
On Thu, Nov 7, 2019 at 5:14 PM V0lleyBallJunki3
wrote:
> Consider an example where I have a cluster with 5 nodes and each
Hey all,
I want to load a parquet containing my edges into an Graph my code so far
looks like this:
val edgesDF = spark.read.parquet("/path/to/edges/parquet/")
val edgesRDD = edgesDF.rdd
val graph = Graph.fromEdgeTuples(edgesRDD, 1)
But simply this produces an error:
[error] found :
org.apac
Hey All,
I'm trying to run a pagerank with GraphFrames on a large graph (about 90
billion edges and 1.4TB total disk size) and I'm running into some issues.
The code is very simplistic it's just load dataframes from S3 and put them
into the GraphFrames pagerank function. But when I run it the clust
As the subject suggest I want to output an parquet to S3. I know this was
rather troublesome in the past because of S3 not having a move but needed
to do a copy+delete.
This issues has been discussed before see:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-files-to-s3-with-out-tempor
I have graph that has a couple of dangling edges. I use pyspark and work
with spark 2.2.0. It kind of looks like this:
g.vertices.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
g.edges.show()
+---++
|src| dst|
+---++
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 1|
| 4|null|
+---++
You could try to force a repartion right at that point by producing a
cached version of the DF with .cache() if memory allows it.
On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari
wrote:
> I've tried that too - it doesn't work. It does a repetition, but not right
> after the broadcast join - it do
> are not careful it can end up burning you financially).
>
> Regards,
> Gourav Sengupta
>
> On Mon, Nov 27, 2017 at 12:58 PM, Alexander Czech <
> alexander.cz...@googlemail.com> wrote:
>
>> I don't use EMR I spin my clusters up using flintrock (beeing a stud
n that
> you are using.
>
>
> On another note also mention the AWS Region you are in. If Redshift
> Spectrum is available, or you can use Athena, or you can use Presto, then
> running massive aggregates over huge data sets at fraction of cost and at
> least 10x speed may be
,vector,text
Basically there is one very big shuffle going on the rest is not that
heavy. The CPU intense lifting was done before that.
On Mon, Nov 27, 2017 at 12:03 PM, Alexander Czech <
alexander.cz...@googlemail.com> wrote:
> I have a temporary result file ( the 10TB one) that looks like
I want to load a 10TB parquet File from S3 and I'm trying to decide what
EC2 instances to use.
Should I go for instances that in total have a larger memory size than
10TB? Or is it enough that they have in total enough SSD storage so that
everything can be spilled to disk?
thanks
:
> How many files you produce? I believe it spends a lot of time on renaming
> the files because of the output committer.
> Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
> have 10GbE and you can get good throughput for S3.
>
> On Fri, Sep 29, 2017 at 9:15 A
Does each gzip file look like this:
{json1}
{json2}
{json3}
meaning that each line is a separate json object?
I proccess a similar large file batch and what I do is this:
input.txt # each line in input.txt represents a path to a gzip file each
containing a json object every line
my_rdd = sc.par
Yes you need to store the file at a location where it is equally
retrievable ("same path") for the master and all nodes in the cluster. A
simple solution (apart from a HDFS) that does not scale to well but might
be a OK with only 3 nodes like in your configuration is a network
accessible storage (a
I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
parquet files to S3. But the S3 performance for various reasons is bad when
I access s3 through the parquet write method:
df.write.parquet('s3a://bucket/parquet')
Now I want to setup a small cache for the parquet output. One o
ot; wrote:
Spark manage memory allocation and release automatically. Can you post the
complete program which help checking where is wrong ?
On Wed, Sep 20, 2017 at 8:12 PM, Alexander Czech <
alexander.cz...@googlemail.com> wrote:
> Hello all,
>
> I'm running a pyspark script
Hello all,
I'm running a pyspark script that makes use of for loop to create smaller
chunks of my main dataset.
some example code:
for chunk in chunks:
my_rdd = sc.parallelize(chunk).flatmap(somefunc)
# do some stuff with my_rdd
my_df = make_df(my_rdd)
# do some stuff with my_df
I'm running a Jupyter-Spark setup and I want to benchmark my cluster with
different input parameters. To make sure the enivorment stays the same I'm
trying to reset(restart) the SparkContext, here is some code:
*temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')
17 matches
Mail list logo