Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Umesh Kacha
Hi I am using Hadoop 2.4.0 it is not frequent sometimes it happens I dont think my spark logic has any problem if logic would have been wrong it would be failing everyday. I see mostly YARN killed executors so I see executor lost in my driver logs. On Thu, Feb 25, 2016 at 10:30 PM, Yin Yang

Re: Spark Streaming with Druid?

2016-02-08 Thread Umesh Kacha
Hi Hemant, thanks much can we use SnappyData on YARN. My Spark jobs run using yarn client mode. Please guide. On Mon, Feb 8, 2016 at 9:46 AM, Hemant Bhanawat wrote: > You may want to have a look at spark druid project already in progress: >

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-12 Thread Umesh Kacha
running in any > NodeManager machine as a container. > > YARN RM UI running jobs will have the host details where executor is > running. Login to that NodeManager machine and jps -l will list all java > processes, jstack -l will give the stack trace. > > > Thanks, > Prabhu

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-11 Thread Umesh Kacha
gt; > Use jstack -l or kill -3 , where pid is the process id of the > executor process. Take jstack stack trace for every 2 seconds and total 1 > minute. This will help to identify the code where threads are spending lot > of time and then try to tune. > > Thanks, > Prabhu Jos

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-08 Thread Umesh Kacha
Hi for a 30 GB executor how much offheap should I give along with yarn memory over head is it ok? On Thu, Jan 7, 2016 at 4:24 AM, Ted Yu wrote: > Turns out that I should have specified -i to my former grep command :-) > > Thanks Marcelo > > But does this mean that

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Umesh Kacha
ok thanks so it will be enabled by default always if yes then in documentation why default shuffle manager is mentioned as sort? On Sat, Jan 9, 2016 at 1:55 AM, Ted Yu wrote: > From sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala > : > > case

Re: Why is this job running since one hour?

2016-01-07 Thread Umesh Kacha
Hi thanks for the response. Each Job is processing around 5gb of skewed data does group by multiple fields and does aggregation and does coalesce(1) and saves csv file in gzip format. I think coalesce is causing problem but data is not that huge I don't understand why it keeps on running for an

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Umesh Kacha
Hi Yin, thanks much your answer solved my problem. Really appreciate it! Regards On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai wrote: > Hi, we made the change because the partitioning discovery logic was too > flexible and it introduced problems that were very confusing to

RE: Spark on Apache Ingnite?

2016-01-05 Thread Umesh Kacha
Hi Nate thanks much. I have exact same use cases mentioned by you. My spark job does heavy writing involving group by and huge data shuffling. Can you please provide any pointer how can I run my existing spark job which is running on yarn to make it run on ignite? Please guide. Thanks again. On

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Umesh Kacha
Hi dataframe has not boolean option for coalesce it is only for RDD I believe sourceFrame.coalesce(1,true) //gives compilation error On Wed, Jan 6, 2016 at 1:38 AM, Alexander Pivovarov wrote: > try coalesce(1, true). > > On Tue, Jan 5, 2016 at 11:58 AM, unk1102

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Umesh Kacha
Hi thanks I did that and I have attached thread dump images. That was the intention of my question asking for help to identify which waiting thread is culprit. Regards, Umesh On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph wrote: > Take thread dump of Executor process

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
Hi thanks you understood question incorrectly. First of all I am passing UDF name as String and if you see callUDF arguments then it does not take string as first argument and if I use callUDF it will throw me exception saying percentile_approx function not found. And another thing I mentioned is

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
park.sql._ > > val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") > val sqlContext = df.sqlContext > sqlContext.udf.register("simpleUDF", (v: Int) => v * v) > df.select($"id", callUD

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Umesh Kacha
Hi Benyi thanks for the reply yes I call each hive partition/ hdfs directory in one thread so that I can make it faster if I dont use threads then job is even more slow. Like I mentioned I have to process 2000 hive partitions so 2000 hdfs direcotories containing ORC files right? If I dont use

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-11-02 Thread Umesh Kacha
Please guide. On Mon, Oct 19, 2015 at 4:30 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Ted thanks much for your help really appreciate it. I tried to use > maven dependencies you mentioned but still callUdf is not compiling please > find snap shot of my intellij editor. I a

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Umesh Kacha
ut I dont understand why this function call works in Spark >> 1.5.1 spark-shell/bin. Please guide. >> >> -- Forwarded message -- >> From: "Ted Yu" <yuzhih...@gmail.com> >> Date: Oct 14, 2015 3:26 AM >> Subject: Re: How to calculate pe

Re: How to calculate percentile of a column of DataFrame?

2015-10-14 Thread Umesh Kacha
reeNode$$anonfun$3.apply(TreeNode.scala:227) > > SPARK-10671 is included. > For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL > treats percentile_approx as normal UDF. > > Experts can correct me, if there is any misunderstanding. > > Cheers > > On Tue, Oct

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
> Cheers > > On Oct 13, 2015, at 12:21 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi Ted, thanks much I tried using percentile_approx in Spark-shell like > you mentioned it works using 1.5.1 but it doesn't compile in Java using > 1.5.1 maven libraries it still complai

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
> > scala> df.select(callUDF("percentile_approx",col("value"), > lit(0.25))).show() > +--+ > |'percentile_approx(value,0.25)| > +--+ > | 1.0| > +--+ > &g

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
wrote: >> >>> Looks like the fix went in after 1.5.1 was released. >>> >>> You may verify using master branch build. >>> >>> Cheers >>> >>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: >>>

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
core spark sql spark hive version 1.5.1 On Oct 13, 2015 18:21, "Ted Yu" <yuzhih...@gmail.com> wrote: > Can you pastebin your Java code and the command you used to compile ? > > Thanks > > On Oct 13, 2015, at 1:42 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted sorry for asking again. Did you get chance to look at compilation issue? Thanks much. Regards. On Oct 13, 2015 18:39, "Umesh Kacha" <umesh.ka...@gmail.com> wrote: > Hi Ted I am using the following line of code I can't paste entire code > sorry but the following onl

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi if you can help it would be great as I am stuck don't know how to remove compilation error in callUdf when we pass three parameters function name string column name as col and lit function please guide On Oct 11, 2015 1:05 AM, "Umesh Kacha" <umesh.ka...@gmail.com> wrote: >

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
hout the lit() parameter ? > > Cheers > > On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha <umesh.ka...@gmail.com> > wrote: > >> Hi if you can help it would be great as I am stuck don't know how to >> remove compilation error in callUdf when we pass three param

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in Spark 1.4.0 as per JAvadocx On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Ted thanks much for the detailed answer and appreciate your efforts. Do > we need to regist

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
id", callUDF("simpleUDF", $"value", lit(25))).show() > +---++ > | id|'simpleUDF(value,25)| > +---++ > |id1| 26| > |id2| 41| > |id3| 50| > +---++ > > Wh

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
---+ > |'percentile_approx(value,0.25)| > +--+ > | 1.0| > +--+ > > Can you upgrade to 1.5.1 ? > > Cheers > > On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <umesh.ka...@gmail.com> > wr

Re: How to calculate percentile of a column of DataFrame?

2015-10-10 Thread Umesh Kacha
Hi any idea? how do I call percentlie_approx using callUdf() please guide. On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > I have a doubt Michael I tried to use callUDF in the following code it > does not work. > > sourceFrame.agg(callUdf("

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Umesh Kacha
Hi Alex thanks for the response. I am using 40 executor with 30 gb including 5 gb menoryOverhead and 4 cores. My cluster has around 100 nodes with 30 gig and 8 cores. On Oct 11, 2015 06:54, "Alex Rovner" wrote: > How many executors are you running with? How many nodes

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
I found it in 1.3 documentation lit says something else not percent public static Column lit(Object literal) Creates a Column of literal

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
her curious. How is def lit(literal:Any) --> >> becomes a percentile function lit(25) >> >> >> >> Thanks for clarification >> >> Saif >> >> >> >> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com] >> *Sent:* Friday, October 09,

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
lease guide. On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > thanks much Michael let me try. > > On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> This is confusing because I made a typo... >

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan, thanks for the response yes I know and I have confirmed in UI that it has only 12 partitions because of 12 HDFS blocks and hive orc file strip size is 33554432. On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang wrote: > The partition number should be the same as the HDFS

Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
will be shuffle. How large is your ORC file? Have you used > NameNode UI to check how many HDFS blocks each ORC file has? > > Lan > > > On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi Lan, thanks for the response yes I know and I have confir

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Umesh Kacha
artitioned tables it that hive does not support dynamic discovery unless > you manually run the repair command. > > On Tue, Oct 6, 2015 at 9:33 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > >> Hi Ted thanks I know I solved that by using dataframe for both reading >> and

Re: ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread Umesh Kacha
Hi Ted thanks I know I solved that by using dataframe for both reading and writing. I am running into different problem now if spark can read hive orc files why can't hive read orc files created by Spark? On Oct 6, 2015 9:28 PM, "Ted Yu" wrote: > See this thread: >

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-05 Thread Umesh Kacha
b > fails? Without the error message(s), it's hard to even suggest anything. > > *Alex Rovner* > *Director, Data Engineering * > *o:* 646.759.0052 > > * <http://www.magnetic.com/>* > > On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha <umesh.ka...@gmail.com> wrot

Re: Store DStreams into Hive using Hive Streaming

2015-10-05 Thread Umesh Kacha
Hi no didn't find any solution still I need that feature of hive streaming using Spark please let me know if you get something. Alternative solution is to use storm for hive processing. I would like to stick to Spark so still searching. On Oct 5, 2015 2:51 PM, "Krzysztof Zarzycki"

Re: Hive ORC Malformed while loading into spark data frame

2015-10-04 Thread Umesh Kacha
version, and put it to SqlContext. I think you can open a JIRA to tracking > this upgrade. > > BTW, my name is Zhan Zhang instead of Zang. > > Thanks. > > Zhan Zhang > > On Oct 3, 2015, at 2:18 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi Zang any idea

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-04 Thread Umesh Kacha
:36 AM, Michael Armbrust <mich...@databricks.com> wrote: > callUDF("MyUDF", col("col1").as("name") > > or > > callUDF("MyUDF", col("col1").alias("name") > > On Fri, Oct 2, 2015 at 3:29 PM, Umesh Kacha <umesh.ka...@gmail

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
Hi Alex thanks much for the reply. Please read the following for more details about my problem. http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn My each container has 8 core and 30 GB max memory. So I am using yarn-client mode using 40 executors with 27GB/2 cores. If

Re: Hive ORC Malformed while loading into spark data frame

2015-10-03 Thread Umesh Kacha
Hi Zang any idea why is this happening? I can load ORC files created by Hive table but I cant load ORC files created by Spark itself. It looks like bug. On Wed, Sep 30, 2015 at 12:03 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Zang thanks much please find the code below > &g

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
o > submit your job? > > *Alex Rovner* > *Director, Data Engineering * > *o:* 646.759.0052 > > * <http://www.magnetic.com/>* > > On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > >> Hi Alex thanks much for the re

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Umesh Kacha
Hi Michael, Thanks much. How do we give alias name for resultant columns? For e.g. when using hiveContext.sql("select MyUDF("test") as mytest from myTable"); how do we do that in DataFrame callUDF callUDF("MyUDF", col("col1"))??? On Fri, Oct 2, 2015 at 8:23 PM, Michael Armbrust

Re: Hive ORC Malformed while loading into spark data frame

2015-09-30 Thread Umesh Kacha
ty (which is not available in > hive-0.12). > Do you mind post the code that works and not works for you? > > Thanks. > > Zhan Zhang > > On Sep 29, 2015, at 10:05 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi I can read/load orc data create

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Umesh Kacha
Hi Zang, thanks for the response. Table is created using Spark hiveContext.sql and data inserted into table also using hiveContext.sql. Insert into partition table. When I try to load orc data into dataframe I am loading particular partition data stored in path say

Re: Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread Umesh Kacha
frame for both read and write > > Thanks > > Zhan Zhang > > > Sent from my iPhone > > On Sep 29, 2015, at 1:56 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi Zang, thanks for the response. Table is created using Spark > hiveContext.sql and data inserted i

Re: Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread Umesh Kacha
Hi Richard, thanks much for the reply. If I dont create threads job runs too slow since I have thousand jobs or thousand hive partitions directory to process. hiveContext.sql(...) runs fine and creates output as I expected do I need to call any action method really? Job works fine as expected I am

Re: How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread Umesh Kacha
Nice Ted thanks much highest performance without any configuration changes amazed! Looking forward to running Spark 1.5 on my 2 tb skewed data which involves group by union etc any other tips if you know for spark 1.5 On Sep 10, 2015 8:12 PM, "Ted Yu" wrote: > Please see

Re: Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-09 Thread Umesh Kacha
Hi Richard, thanks for the response. My use case is weird I need to process data row by row for one partition and update required rows. Updated rows percentage would be 30%. As per above stackoverflow.com answer suggestions I refactored code to use mappartitionswithindex JavaRDD indexedRdd =

Re: NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread Umesh Kacha
Hi Zhan, thanks for the reply. Yes schema should be same actually I am reading Hive table partitions as ORC format into Spark. So I believe it should be same. I am new to Hive so dont know if schema can be different in Hive partitioned table. On Wed, Sep 9, 2015 at 12:16 AM, Zhan Zhang

Re: Spark executor OOM issue on YARN

2015-08-31 Thread Umesh Kacha
Hi Ted thanks I know by default spark.sql.shuffle.partition are 200. It would be great if you help me solve OOM issue. On Mon, Aug 31, 2015 at 11:43 PM, Ted Yu wrote: > Please see this thread w.r.t. spark.sql.shuffle.partitions : >

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Umesh Kacha
Hi Hemant sorry for the confusion I meant final output part files in the final directory hdfs I never meant intermediate files. Thanks. My goal is to reduce those many files because of my use case explained in the first email with calculations. On Aug 20, 2015 5:59 PM, Hemant Bhanawat

Re: How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-20 Thread Umesh Kacha
spark.yarn.executor.memoryOverhead? YARN may be killing your executors for using too much off-heap space. You can see whether this is happening by looking in the Spark AM or YARN NodeManager logs. -Sandy On Thu, Aug 20, 2015 at 7:39 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks much

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Umesh Kacha
Hi Michael thanks for the reply. I know that I can create DataFrame using JavaBean or Struct Type I want to know how can I create DataFrame from above code which is custom Hadoop format. On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust mich...@databricks.com wrote: You can't create a

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Umesh Kacha
data schema. The key idea is to show the flexibility to deal with any format of data by using your own schema. Sorry if I did not make you fully understand. Anyway, let us know once you figure out your problem. On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Bo

Re: How to create DataFrame from a binary file?

2015-08-09 Thread Umesh Kacha
Hi Bo I know how to create a DataFrame my question is how to create a DataFrame for binary files and in your blog it is raw text json files please read my question properly thanks. On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote: You can create your own data schema

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-08-03 Thread Umesh Kacha
Hi all any help will be much appreciated my spark job runs fine but in the middle it starts loosing executors because of netafetchfailed exception saying shuffle not found at the location since executor is lost On Jul 31, 2015 11:41 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Umesh Kacha
a Writable. BTW in the future, capture text output instead of image. Thanks On Fri, Jul 31, 2015 at 12:35 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Ted thanks My key is always Void because my custom format file is non splittable so key is Void and values is MyRecordWritable which

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-31 Thread Umesh Kacha
Hi thanks for the response. It looks like YARN container is getting killed but dont know why I see shuffle metafetchexception as mentioned in the following SO link. I have enough memory 8 nodes 8 cores 30 gig memory each. And because of this metafetchexpcetion YARN killing container running

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Umesh Kacha
Hi Ted thanks much for the reply. I cant share code on public forum. I have created custom format by extending Hadoop mapred InputFormat class and same way RecordReader class. If you can help me how do I use the same in DataFrame it would be very helpful. On Sat, Aug 1, 2015 at 12:12 AM, Ted Yu

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Umesh Kacha
, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Umesh Kacha
Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote: The last time someone brought this up on the mailing list, the issue

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha
in your sql? Have a look at spark.sql.shuffle.partitions? Srikanth On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Srikant thanks for the response. I have the following code: hiveContext.sql(insert into... ).coalesce(6) Above code does not create 6 part files

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
Hi I tried both approach using df. repartition(6) and df.coalesce(6) it doesn't reduce part-x files. Even after calling above method I still see around 200 small part files of size 20 mb each which is again orc files. On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu

Re: How do we control output part files created by Spark job?

2015-07-07 Thread Umesh Kacha
yourRdd.coalesce(6).saveAsTextFile() or yourRdd.coalesce(6) yourRdd.saveAsTextFile() ? Srikanth On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi I tried both approach using df. repartition(6) and df.coalesce(6) it doesn't