Re: SparkR DataFrame , Out of memory exception for very small file.

Vipul Rai Mon, 23 Nov 2015 01:34:07 -0800

Hi Jeff,

This is only part of the actual code.


My questions are mentioned in comments near the code.

SALES<- SparkR::sql(hiveContext, "select * from sales")
PRICING<- SparkR::sql(hiveContext, "select * from pricing")


## renaming of columns ##
#sales file#

# Is this right ??? Do we have to create a new DF for every column Addition
to the original DF.

# And if we do that , then what about the older DF , they will also take
memory ?

names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO"
names(SALES)[which(names(SALES)=="store_no")]<-"STORE_NO"

#pricing file#
names(PRICING)[which(names(PRICING)=="price_type_cd")]<-"PRICE_TYPE"
names(PRICING)[which(names(PRICING)=="price_amt")]<-"PRICE_AMT"

registerTempTable(SALES,"sales")
registerTempTable(PRICING,"pricing")

#merging sales and pricing file#
merg_sales_pricing<- SparkR::sql(hiveContext,"select .....................")

head(merg_sales_pricing)


Thanks,
Vipul


On 23 November 2015 at 14:52, Jeff Zhang <zjf...@gmail.com> wrote:

> If possible, could you share your code ? What kind of operation are you
> doing on the dataframe ?
>
> On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com> wrote:
>
>> Hi Zeff,
>>
>> Thanks for the reply, but could you tell me why is it taking so much time.
>> What could be wrong , also when I remove the DataFrame from memory using
>> rm().
>> It does not clear the memory but the object is deleted.
>>
>> Also , What about the R functions which are not supported in SparkR.
>> Like ddply ??
>>
>> How to access the nth ROW of SparkR DataFrame.
>>
>> Regards,
>> Vipul
>>
>> On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> Do I need to create a new DataFrame for every update to the
>>> DataFrame like
>>> addition of new column or  need to update the original sales DataFrame.
>>>
>>> Yes, DataFrame is immutable, and every mutation of DataFrame will
>>> produce a new DataFrame.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com>
>>> wrote:
>>>
>>>> Hello Rui,
>>>>
>>>> Sorry , What I meant was the resultant of the original dataframe to
>>>> which a new column was added gives a new DataFrame.
>>>>
>>>> Please check this for more
>>>>
>>>> https://spark.apache.org/docs/1.5.1/api/R/index.html
>>>>
>>>> Check for
>>>> WithColumn
>>>>
>>>>
>>>> Thanks,
>>>> Vipul
>>>>
>>>>
>>>> On 23 November 2015 at 12:42, Sun, Rui <rui....@intel.com> wrote:
>>>>
>>>>> Vipul,
>>>>>
>>>>> Not sure if I understand your question. DataFrame is immutable. You
>>>>> can't update a DataFrame.
>>>>>
>>>>> Could you paste some log info for the OOM error?
>>>>>
>>>>> -----Original Message-----
>>>>> From: vipulrai [mailto:vipulrai8...@gmail.com]
>>>>> Sent: Friday, November 20, 2015 12:11 PM
>>>>> To: user@spark.apache.org
>>>>> Subject: SparkR DataFrame , Out of memory exception for very small
>>>>> file.
>>>>>
>>>>> Hi Users,
>>>>>
>>>>> I have a general doubt regarding DataFrames in SparkR.
>>>>>
>>>>> I am trying to read a file from Hive and it gets created as DataFrame.
>>>>>
>>>>> sqlContext <- sparkRHive.init(sc)
>>>>>
>>>>> #DF
>>>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true',
>>>>>                  source = "com.databricks.spark.csv",
>>>>> inferSchema='true')
>>>>>
>>>>> registerTempTable(sales,"Sales")
>>>>>
>>>>> Do I need to create a new DataFrame for every update to the DataFrame
>>>>> like addition of new column or  need to update the original sales 
>>>>> DataFrame.
>>>>>
>>>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as
>>>>> a")
>>>>>
>>>>>
>>>>> Please help me with this , as the orignal file is only 20MB but it
>>>>> throws out of memory exception on a cluster of 4GB Master and Two workers
>>>>> of 4GB each.
>>>>>
>>>>> Also, what is the logic with DataFrame do I need to register and drop
>>>>> tempTable after every update??
>>>>>
>>>>> Thanks,
>>>>> Vipul
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vipul Rai
>>>> www.vipulrai.me
>>>> +91-8892598819
>>>> <http://in.linkedin.com/in/vipulrai/>
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Regards,
>> Vipul Rai
>> www.vipulrai.me
>> +91-8892598819
>> <http://in.linkedin.com/in/vipulrai/>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Regards,
Vipul Rai
www.vipulrai.me
+91-8892598819
<http://in.linkedin.com/in/vipulrai/>

Re: SparkR DataFrame , Out of memory exception for very small file.

Reply via email to