Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread vijay.bvp
apologies for the long answer. understanding partitioning at each stage of the the RDD graph/lineage is important for efficient parallelism and having load balanced. This applies to working with any sources streaming or static. you have tricky situation here of one source kafka with 9

sqoop import job not working when spark thrift server is running.

2018-02-19 Thread akshay naidu
Hello , I was trying to optimize my spark cluster. I did it to some extent by doing some changes in yarn-site.xml and spark-defaults.conf file. before the changes the mapreduce import job was running fine along with slow thrift server. after changes, i have to kill the thrift server to execute my

Re: Does Pyspark Support Graphx?

2018-02-19 Thread xiaobo
When using the --jars option, we should include it every time we submit a job , it seems add the jars to the classpath to every slave node a spark is only way to "install" spark packages. -- Original -- From: Nicholas Hakobian

Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-19 Thread xiaobo
So the question comes to does graphframes support bidirectional relationship natively with only one edge? -- Original -- From: Felix Cheung Date: Tue,Feb 20,2018 10:01 AM To: xiaobo , user@spark.apache.org

Errors when running unit tests

2018-02-19 Thread karuppayya
Hi , I get errors like below when trying to run the spark unit tests zipPartitions(test.org.apache.spark.Java8RDDAPISuite) Time elapsed: 2.212 > sec <<< ERROR! > java.lang.IllegalStateException: failed to create a child event loop > at

Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-19 Thread Felix Cheung
Generally that would be the approach. But since you have effectively double the number of edges this will likely affect the scale your job will run. From: xiaobo Sent: Monday, February 19, 2018 3:22:02 AM To: user@spark.apache.org Subject:

Re: KafkaUtils.createStream(..) is removed for API

2018-02-19 Thread Cody Koeninger
I can't speak for committers, but my guess is it's more likely for DStreams in general to stop being supported before that particular integration is removed. On Sun, Feb 18, 2018 at 9:34 PM, naresh Goud wrote: > Thanks Ted. > > I see createDirectStream is

Re: Does Pyspark Support Graphx?

2018-02-19 Thread Nicholas Hakobian
If you copy the Jar file and all of the dependencies to the machines, you can manually add them to the classpath. If you are using Yarn and HDFS you can alternatively use --jars and point it to the hdfs locations of the jar files and it will (in most cases) distribute them to the worker nodes at

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread Aleksandar Vitorovic
Hi Vijay, Thank you very much for your reply. Setting the number of partitions explicitly in the join, and memory pressure influence on partitioning were definitely very good insights. At the end, we avoid the issue of uneven load balancing completely by doing the following two: a) Reducing the

Understand task timing

2018-02-19 Thread Thomas Decaux
Using Spark 1.6.2, I want to understand what « Duration » really mean (and why is slow). Running a simple SELECT COUNT against a parquet file, stored within HDFS: NODE_LOCAL 1 / DATA02 2018/02/19 09:54:27 5 s 30 ms 8.8 MB (hadoop) / 3010830 8 ms 77.2 KB / 1666 This means "took 5 secondes to

Unsubscribe

2018-02-19 Thread Ryan Myer
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org