Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Jasbir, Yes, you are right. Do you have any idea about my question? Thanks, Fei On Mon, Jan 16, 2017 at 12:37 AM, wrote: > Hi, > > > > Coalesce is used to decrease the number of partitions. If you give the > value of numPartitions greater than the current

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
Hi, Coalesce is used to decrease the number of partitions. If you give the value of numPartitions greater than the current partition, I don’t think RDD number of partitions will be increased. Thanks, Jasbir From: Fei Hu [mailto:hufe...@gmail.com] Sent: Sunday, January 15, 2017 10:10 PM To:

log4j2 support in Spark

2017-01-15 Thread Appu K
Wondering whether it’ll be possible to do structured logging in Spark. Adding "org.apache.logging.log4j" % "log4j-slf4j-impl" % “2.6.2” makes it to complain about multiple bindings for slf4j cheers Appu

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-15 Thread Raju Bairishetti
Waiting for suggestions/help on this... On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti wrote: > Hello, > >Spark sql is generating query plan with all partitions information even > though if we apply filters on partitions in the query. Due to this, spark > driver/hive

Re: Spark Log information

2017-01-15 Thread KhajaAsmath Mohammed
Thanks Raju. On Sun, Jan 15, 2017 at 9:49 PM, Raju Bairishetti wrote: > Total number of tasks in the stage is: 21428 > Number of tasks completed so far: 44 > Number of tasks running now: 48 > > On Mon, Jan 16, 2017 at 11:41 AM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com>

Re: Spark Log information

2017-01-15 Thread Raju Bairishetti
Total number of tasks in the stage is: 21428 Number of tasks completed so far: 44 Number of tasks running now: 48 On Mon, Jan 16, 2017 at 11:41 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > when running the spark jobs, I see the numbers in stages. can anyone tell > what

Spark Log information

2017-01-15 Thread KhajaAsmath Mohammed
Hi, when running the spark jobs, I see the numbers in stages. can anyone tell what does this number indicate in the below case. [Stage 2:>(44 + 48) / 21428] 44+28 and 21428. Thanks, Asmath

Re: TDD in Spark

2017-01-15 Thread Miguel Morales
I've also written a small blog post that may help you out: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941#.ia6stbl6n On Sun, Jan 15, 2017 at 12:13 PM, Silvio Fiorito wrote: > You should check out Holden’s excellent

Re: Old version of Spark [v1.2.0]

2017-01-15 Thread ayan guha
No WorriesI also faced the issue a while back and good people in the community helped me:) On Mon, Jan 16, 2017 at 9:55 AM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Ayan, > > Thanks a million. > > Regards, > _ > *Md. Rezaul Karim*,

Re: Old version of Spark [v1.2.0]

2017-01-15 Thread Md. Rezaul Karim
Hi Ayan, Thanks a million. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html

Re: Old version of Spark [v1.2.0]

2017-01-15 Thread ayan guha
archive.apache.org will always have all the releases: http://archive.apache.org/dist/spark/ @Spark users: it may be a good idea to have a "To download older versions, click here" link to Spark Download page? On Mon, Jan 16, 2017 at 8:16 AM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org>

Old version of Spark [v1.2.0]

2017-01-15 Thread Md. Rezaul Karim
Hi, I am looking for Spark 1.2.0 version. I tried to download in the Spark website but it's no longer available. Any suggestion? Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway

Re: TDD in Spark

2017-01-15 Thread Silvio Fiorito
You should check out Holden’s excellent spark-testing-base package: https://github.com/holdenk/spark-testing-base From: A Shaikh Date: Sunday, January 15, 2017 at 1:14 PM To: User Subject: TDD in Spark Whats the most popular Testing approach for

Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
use yarn :) "spark-submit --master yarn" On Sun, Jan 15, 2017 at 7:55 PM, Darren Govoni wrote: > So what was the answer? > > > > Sent from my Verizon, Samsung Galaxy smartphone > > Original message > From: Andrew Holway >

Re: Running Spark on EMR

2017-01-15 Thread Darren Govoni
So what was the answer? Sent from my Verizon, Samsung Galaxy smartphone Original message From: Andrew Holway Date: 1/15/17 11:37 AM (GMT-05:00) To: Marco Mistroni Cc: Neil Jonkers , User

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your information. I will look into the CoalescedRDD code. Thanks, Fei On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias wrote: > Hi Fei, > > I looked at the code of CoalescedRDD and probably what I suggested will > not work. > > Speaking of

TDD in Spark

2017-01-15 Thread A Shaikh
Whats the most popular Testing approach for Spark App. I am looking something in the line of TDD.

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work. Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner? Thanks, Fei On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias

Re: Running Spark on EMR

2017-01-15 Thread Andrew Holway
Darn. I didn't respond to the list. Sorry. On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni wrote: > thanks Neil. I followed original suggestion from Andrw and everything is > working fine now > kr > > On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi, Thanks for your reply! The RDD has 24 partitions, and the cluster has a master node + 24 computing nodes (12 cores per node). Each node will have a partition, and I want to split each partition to two sub-partitions on the same node to improve the parallelism and achieve high data

Re: Running Spark on EMR

2017-01-15 Thread Marco Mistroni
thanks Neil. I followed original suggestion from Andrw and everything is working fine now kr On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers wrote: > Hello, > > Can you drop the url: > > spark://master:7077 > > The url is used when running Spark in standalone mode. > >

Re: Running Spark on EMR

2017-01-15 Thread Neil Jonkers
Hello, Can you drop the url:  spark://master:7077 The url is used when running Spark in standalone mode. Regards Original message From: Marco Mistroni Date:15/01/2017 16:34 (GMT+02:00) To: User Subject: Running Spark on EMR

Running Spark on EMR

2017-01-15 Thread Marco Mistroni
hi all could anyone assist here? i am trying to run spark 2.0.0 on an EMR cluster,but i am having issues connecting to the master node So, below is a snippet of what i am doing sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate() sparkHost is passed as input

Re: What can mesos or yarn do that spark standalone cannot do?

2017-01-15 Thread Sean Owen
The biggest thing that any resource manager besides Spark's standalone resource manager can do is manage other application resources. In a cluster where you are running other workloads, you can't use Spark standalone to arbitrate resource requirements across apps. On Sun, Jan 15, 2017 at 1:55 PM

What can mesos or yarn do that spark standalone cannot do?

2017-01-15 Thread kant kodali
Hi, What can mesos or yarn do that spark standalone cannot do? Thanks!

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395 coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a