Re: Logical Plan

2016-06-30 Thread Darshan Singh
> Perhaps the real "fix" is to figure out why is logical plan creation so > slow for 700 columns. > > > On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <darshan.m...@gmail.com> > wrote: > >> Is there a way I can use same Logical plan for a query. Everything will >

Logical Plan

2016-06-30 Thread Darshan Singh
Is there a way I can use same Logical plan for a query. Everything will be same except underlying file will be different. Issue is that my query has around 700 columns and Generating logical plan takes 20 seconds and it happens every 2 minutes but every time underlying file is different. I do

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
fle.partitions',10) > sc = SparkContext(conf=conf) > > > > On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <darshan.m...@gmail.com> > wrote: > >> Hi, >> >> My default parallelism is 100. Now I join 2 dataframes with 20 partitions >> each , joined d

Partitioning in spark

2016-06-23 Thread Darshan Singh
Hi, My default parallelism is 100. Now I join 2 dataframes with 20 partitions each , joined dataframe has 100 partition. I want to know what is the way to keep it to 20 (except re-partition and coalesce. Also, when i join these 2 dataframes I am using 4 columns as joined columns. The dataframes

Confusion about spark.shuffle.memoryFraction and spark.storage.memoryFraction

2016-06-23 Thread Darshan Singh
These are 2 parameters and default value for these are 0.6 and 0.2 which is around 80%. I am wondering where remaining 0.2 % goes. Is it for JVM for other memory requirements? If yes, then what is spark.memory.fraction used for. My understanding is that if we have 10GB of memory per executor

spark standalone High availibilty issues

2016-06-14 Thread Darshan Singh
Hi, I am using standalone spark cluster and using zookeeper cluster for the high availbilty. I am getting sometimes error when I start the master. The error is related to Leader election in curator and says that noMethod found (getProcess) and master doesnt get started. Just wondering what could

SortWithinPartitions on DataFrame

2016-05-05 Thread Darshan Singh
Hi, I have a dataframe df1 and I partitioned it by col1,col2 and persisted it. Then I created new dataframe df2. val df2 = df1.sortWithinPartitions("col1","col2","col3") df1.persist() df2.persist() df1.count() df2.count() now I expect that any group by statement using the "col1","col2","col3"

large scheduler delay

2016-04-18 Thread Darshan Singh
Hi, I have an application which uses 3 parquet files , 2 of which are large and another one is small. These files are in hdfs and are partitioned by column "col1". Now I create 3 data-frames one for each parquet file but I pass col1 value so that it reads the relevant data. I always read from

Max number of dataframes in Union

2016-04-11 Thread Darshan Singh
Hi, I would like to know if there is any max limit of union of data-frames. How does performance of say 1 data frame union will be in spark of which all the data will be in cache? Other option is 1 partitions of a single dataframe. Thanks

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
spark.apache.org > > Thanks for the information. When I mention map side join. I meant that > each partition from 1 DF join with partition with same key of DF 2 on the > worker node without shuffling the data.In other words do as much as work > within worker node before shuffling the

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
Thanks for the information. When I mention map side join. I meant that each partition from 1 DF join with partition with same key of DF 2 on the worker node without shuffling the data.In other words do as much as work within worker node before shuffling the data. Thanks Darshan Singh On Wed

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
Thanks a lot for this. I was thinking of using cogrouped RDDs. We will try to move to 1.6 as there are other issues as well in 1.5.2. Same code is much faster in the 1.6.1.But plan wise I do not see much diff.Why it is still partitioning and then sorting and then joining? I expect it to sort

Re: Plan issue with spark 1.5.2

2016-04-06 Thread Darshan Singh
I used 1.5.2.I have used movies data to reproduce the issue. Below is the physical plan. I am not sure why it is hash partitioning the data and then sort and then join. I expect the data to be joined first and then send for further processing. I sort of expect a common partitioner which will work

Re: Partition pruning in spark 1.5.2

2016-04-06 Thread Darshan Singh
:41 PM, Darshan Singh <darshan.m...@gmail.com> wrote: > Thanks a lot. I will try this one as well. > > On Tue, Apr 5, 2016 at 9:28 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> The following should ensure partition pruning happens: >> >>

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
d.load("/path/to/data").where("country = 'UK'") > > On Tue, Apr 5, 2016 at 1:13 PM, Darshan Singh <darshan.m...@gmail.com> > wrote: > >> Thanks for the reply. >> >> Now I saved the part_movies as parquet file. >> >> Then created new

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
. > > On Tue, Apr 5, 2016 at 12:14 PM, Darshan Singh <darshan.m...@gmail.com> > wrote: > >> Thanks. It is not my exact scenario but I have tried to reproduce it. I >> have used 1.5.2. >> >> I have a part-movies data-frame which has 20 partitions 1 each for a

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
Thanks. It is not my exact scenario but I have tried to reproduce it. I have used 1.5.2. I have a part-movies data-frame which has 20 partitions 1 each for a movie. I created following query val part_sql = sqlContext.sql("select * from part_movies where movie = 10") part_sql.count() I expect