I am using 1.5.2.

I have a data-frame with 10 column and then I pivot 1 column and generate
the 700 columns.

it is like

val df1 = sqlContext.read.parquet("file1")
df1.registerTempTable("df1")
val df2= sqlContext.sql("select col1, col2, sum(case when col3 = 1 then
col4 else 0.0 end ) as col4_1,....,sum(case when col3 = 700 then col4 else
0.0 end ) as col4_700 from df1 group by col1, col2")

Now this last statement takes around 20-30 seconds. I run this a number of
times only difference is that for df1 file is different. Everything else is
same.

The actual statement takes 2-3 seconds so it is bit frustrating that just
generating plan for df2 is taking too much time.Worse thing is that this
run on driver so it is not palatalized.

I have similar issue in another query where from these 700 columns we
generate more columns by adding or subtracting these and it again takes
lots of time.

Not sure what could be done here.

Thanks

On Thu, Jun 30, 2016 at 10:10 PM, Reynold Xin <r...@databricks.com> wrote:

> Which version are you using here? If the underlying files change,
> technically we should go through optimization again.
>
> Perhaps the real "fix" is to figure out why is logical plan creation so
> slow for 700 columns.
>
>
> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh <darshan.m...@gmail.com>
> wrote:
>
>> Is there a way I can use same Logical plan for a query. Everything will
>> be same except underlying file will be different.
>>
>> Issue is that my query has around 700 columns and Generating logical plan
>> takes 20 seconds and it happens every 2 minutes but every time underlying
>> file is different.
>>
>> I do not know these files in advance so I cant create the table on
>> directory level. These files are created and then used in the final query.
>>
>> Thanks
>>
>
>

Reply via email to