apologies for the long answer.
understanding partitioning at each stage of the the RDD graph/lineage is
important for efficient parallelism and having load balanced. This applies
to working with any sources streaming or static.
you have tricky situation here of one source kafka with 9
Hello ,
I was trying to optimize my spark cluster. I did it to some extent by doing
some changes in yarn-site.xml and spark-defaults.conf file. before the
changes the mapreduce import job was running fine along with slow thrift
server.
after changes, i have to kill the thrift server to execute my
When using the --jars option, we should include it every time we submit a job ,
it seems add the jars to the classpath to every slave node a spark is only way
to "install" spark packages.
-- Original --
From: Nicholas Hakobian
So the question comes to does graphframes support bidirectional relationship
natively with only one edge?
-- Original --
From: Felix Cheung
Date: Tue,Feb 20,2018 10:01 AM
To: xiaobo , user@spark.apache.org
Hi ,
I get errors like below when trying to run the spark unit tests
zipPartitions(test.org.apache.spark.Java8RDDAPISuite) Time elapsed: 2.212
> sec <<< ERROR!
> java.lang.IllegalStateException: failed to create a child event loop
> at
Generally that would be the approach.
But since you have effectively double the number of edges this will likely
affect the scale your job will run.
From: xiaobo
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject:
I can't speak for committers, but my guess is it's more likely for
DStreams in general to stop being supported before that particular
integration is removed.
On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote:
> Thanks Ted.
>
> I see createDirectStream is
If you copy the Jar file and all of the dependencies to the machines, you
can manually add them to the classpath. If you are using Yarn and HDFS you
can alternatively use --jars and point it to the hdfs locations of the jar
files and it will (in most cases) distribute them to the worker nodes at
Hi Vijay,
Thank you very much for your reply. Setting the number of partitions
explicitly in the join, and memory pressure influence on partitioning were
definitely very good insights.
At the end, we avoid the issue of uneven load balancing completely by doing
the following two:
a) Reducing the
Using Spark 1.6.2, I want to understand what « Duration » really mean (and why
is slow).
Running a simple SELECT COUNT against a parquet file, stored within HDFS:
NODE_LOCAL 1 / DATA02 2018/02/19 09:54:27 5 s 30 ms 8.8 MB (hadoop) / 3010830 8
ms 77.2 KB / 1666
This means "took 5 secondes to
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
11 matches
Mail list logo