Re: Spark SQL on large number of columns

ayan guha Tue, 19 May 2015 15:32:22 -0700

One option is batch up columns and do the batches in sequence.
On 20 May 2015 00:20, "madhu phatak" <phatak....@gmail.com> wrote:


> Hi,
> Another update, when run on more that 1000 columns I am getting
>
> Could not write class
> __wrapper$1$40255d281a0d4eacab06bcad6cf89b0d/__wrapper$1$40255d281a0d4eacab06bcad6cf89b0d$$anonfun$wrapper$1$$anon$1
> because it exceeds JVM code size limits. Method apply's code too large!
>
>
>
>
>
>
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>
> On Tue, May 19, 2015 at 6:23 PM, madhu phatak <phatak....@gmail.com>
> wrote:
>
>> Hi,
>> Tested with HiveContext also. It also take similar amount of time.
>>
>> To make the things clear, the following is select clause for a given
>> column
>>
>>
>> *aggregateStats( "$columnName" , max( cast($columnName as double)),   
>> |min(cast($columnName as double)), avg(cast($columnName as double)), 
>> count(*) )*
>>
>> aggregateStats is UDF generating case class to hold the values.
>>
>>
>>
>>
>>
>>
>>
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>> On Tue, May 19, 2015 at 5:57 PM, madhu phatak <phatak....@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Tested for calculating values for 300 columns. Analyser takes around 4
>>> minutes to generate the plan. Is this normal?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Madhukara Phatak
>>> http://datamantra.io/
>>>
>>> On Tue, May 19, 2015 at 4:35 PM, madhu phatak <phatak....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I am using spark 1.3.1
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Madhukara Phatak
>>>> http://datamantra.io/
>>>>
>>>> On Tue, May 19, 2015 at 4:34 PM, Wangfei (X) <wangf...@huawei.com>
>>>> wrote:
>>>>
>>>>>  And which version are you using
>>>>>
>>>>> 发自我的 iPhone
>>>>>
>>>>> 在 2015年5月19日，18:29，"ayan guha" <guha.a...@gmail.com> 写道：
>>>>>
>>>>>   can you kindly share your code?
>>>>>
>>>>> On Tue, May 19, 2015 at 8:04 PM, madhu phatak <phatak....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I  am trying run spark sql aggregation on a file with 26k columns. No
>>>>>> of rows is very small. I am running into issue that spark is taking huge
>>>>>> amount of time to parse the sql and create a logical plan. Even if i have
>>>>>> just one row, it's taking more than 1 hour just to get pass the parsing.
>>>>>> Any idea how to optimize in these kind of scenarios?
>>>>>>
>>>>>>
>>>>>>  Regards,
>>>>>>  Madhukara Phatak
>>>>>> http://datamantra.io/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark SQL on large number of columns

Reply via email to