Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hi Jeff,

This is only part of the actual code.

My questions are mentioned in comments near the code.

SALES<- SparkR::sql(hiveContext, "select * from sales")
PRICING<- SparkR::sql(hiveContext, "select * from pricing")


## renaming of columns ##
#sales file#

# Is this right ??? Do we have to create a new DF for every column Addition
to the original DF.

# And if we do that , then what about the older DF , they will also take
memory ?

names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO"
names(SALES)[which(names(SALES)=="store_no")]<-"STORE_NO"

#pricing file#
names(PRICING)[which(names(PRICING)=="price_type_cd")]<-"PRICE_TYPE"
names(PRICING)[which(names(PRICING)=="price_amt")]<-"PRICE_AMT"

registerTempTable(SALES,"sales")
registerTempTable(PRICING,"pricing")

#merging sales and pricing file#
merg_sales_pricing<- SparkR::sql(hiveContext,"select .")

head(merg_sales_pricing)


​Thanks,
Vipul​


On 23 November 2015 at 14:52, Jeff Zhang <zjf...@gmail.com> wrote:

> If possible, could you share your code ? What kind of operation are you
> doing on the dataframe ?
>
> On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com> wrote:
>
>> Hi Zeff,
>>
>> Thanks for the reply, but could you tell me why is it taking so much time.
>> What could be wrong , also when I remove the DataFrame from memory using
>> rm().
>> It does not clear the memory but the object is deleted.
>>
>> Also , What about the R functions which are not supported in SparkR.
>> Like ddply ??
>>
>> How to access the nth ROW of SparkR DataFrame.
>>
>> ​Regards,
>> Vipul​
>>
>> On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> Do I need to create a new DataFrame for every update to the
>>> DataFrame like
>>> addition of new column or  need to update the original sales DataFrame.
>>>
>>> Yes, DataFrame is immutable, and every mutation of DataFrame will
>>> produce a new DataFrame.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com>
>>> wrote:
>>>
>>>> Hello Rui,
>>>>
>>>> Sorry , What I meant was the resultant of the original dataframe to
>>>> which a new column was added gives a new DataFrame.
>>>>
>>>> Please check this for more
>>>>
>>>> https://spark.apache.org/docs/1.5.1/api/R/index.html
>>>>
>>>> Check for
>>>> WithColumn
>>>>
>>>>
>>>> Thanks,
>>>> Vipul
>>>>
>>>>
>>>> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote:
>>>>
>>>>> Vipul,
>>>>>
>>>>> Not sure if I understand your question. DataFrame is immutable. You
>>>>> can't update a DataFrame.
>>>>>
>>>>> Could you paste some log info for the OOM error?
>>>>>
>>>>> -Original Message-
>>>>> From: vipulrai [mailto:vipulrai8...@gmail.com]
>>>>> Sent: Friday, November 20, 2015 12:11 PM
>>>>> To: user@spark.apache.org
>>>>> Subject: SparkR DataFrame , Out of memory exception for very small
>>>>> file.
>>>>>
>>>>> Hi Users,
>>>>>
>>>>> I have a general doubt regarding DataFrames in SparkR.
>>>>>
>>>>> I am trying to read a file from Hive and it gets created as DataFrame.
>>>>>
>>>>> sqlContext <- sparkRHive.init(sc)
>>>>>
>>>>> #DF
>>>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true',
>>>>>  source = "com.databricks.spark.csv",
>>>>> inferSchema='true')
>>>>>
>>>>> registerTempTable(sales,"Sales")
>>>>>
>>>>> Do I need to create a new DataFrame for every update to the DataFrame
>>>>> like addition of new column or  need to update the original sales 
>>>>> DataFrame.
>>>>>
>>>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as
>>>>> a")
>>>>>
>>>>>
>>>>> Please help me with this , as the orignal file is only 20MB but it
>>>>> throws out of memory exception on a cluster of 4GB Master and Two workers
>>>>> of 4GB each.
>>>>>
>>>>> Also, what is the logic with DataFrame do I need to register and drop
>>>>> tempTable after every update??
>>>>>
>>>>> Thanks,
>>>>> Vipul
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>>>> additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vipul Rai
>>>> www.vipulrai.me
>>>> +91-8892598819
>>>> <http://in.linkedin.com/in/vipulrai/>
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Regards,
>> Vipul Rai
>> www.vipulrai.me
>> +91-8892598819
>> <http://in.linkedin.com/in/vipulrai/>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Regards,
Vipul Rai
www.vipulrai.me
+91-8892598819
<http://in.linkedin.com/in/vipulrai/>


Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hello Rui,

Sorry , What I meant was the resultant of the original dataframe to which a
new column was added gives a new DataFrame.

Please check this for more

https://spark.apache.org/docs/1.5.1/api/R/index.html

Check for
WithColumn


Thanks,
Vipul


On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote:

> Vipul,
>
> Not sure if I understand your question. DataFrame is immutable. You can't
> update a DataFrame.
>
> Could you paste some log info for the OOM error?
>
> -Original Message-
> From: vipulrai [mailto:vipulrai8...@gmail.com]
> Sent: Friday, November 20, 2015 12:11 PM
> To: user@spark.apache.org
> Subject: SparkR DataFrame , Out of memory exception for very small file.
>
> Hi Users,
>
> I have a general doubt regarding DataFrames in SparkR.
>
> I am trying to read a file from Hive and it gets created as DataFrame.
>
> sqlContext <- sparkRHive.init(sc)
>
> #DF
> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true',
>  source = "com.databricks.spark.csv", inferSchema='true')
>
> registerTempTable(sales,"Sales")
>
> Do I need to create a new DataFrame for every update to the DataFrame like
> addition of new column or  need to update the original sales DataFrame.
>
> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a")
>
>
> Please help me with this , as the orignal file is only 20MB but it throws
> out of memory exception on a cluster of 4GB Master and Two workers of 4GB
> each.
>
> Also, what is the logic with DataFrame do I need to register and drop
> tempTable after every update??
>
> Thanks,
> Vipul
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Regards,
Vipul Rai
www.vipulrai.me
+91-8892598819
<http://in.linkedin.com/in/vipulrai/>


Re: SparkR DataFrame , Out of memory exception for very small file.

2015-11-23 Thread Vipul Rai
Hi Zeff,

Thanks for the reply, but could you tell me why is it taking so much time.
What could be wrong , also when I remove the DataFrame from memory using
rm().
It does not clear the memory but the object is deleted.

Also , What about the R functions which are not supported in SparkR.
Like ddply ??

How to access the nth ROW of SparkR DataFrame.

​Regards,
Vipul​

On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote:

> >>> Do I need to create a new DataFrame for every update to the DataFrame
> like
> addition of new column or  need to update the original sales DataFrame.
>
> Yes, DataFrame is immutable, and every mutation of DataFrame will produce
> a new DataFrame.
>
>
>
> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> wrote:
>
>> Hello Rui,
>>
>> Sorry , What I meant was the resultant of the original dataframe to which
>> a new column was added gives a new DataFrame.
>>
>> Please check this for more
>>
>> https://spark.apache.org/docs/1.5.1/api/R/index.html
>>
>> Check for
>> WithColumn
>>
>>
>> Thanks,
>> Vipul
>>
>>
>> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote:
>>
>>> Vipul,
>>>
>>> Not sure if I understand your question. DataFrame is immutable. You
>>> can't update a DataFrame.
>>>
>>> Could you paste some log info for the OOM error?
>>>
>>> -Original Message-
>>> From: vipulrai [mailto:vipulrai8...@gmail.com]
>>> Sent: Friday, November 20, 2015 12:11 PM
>>> To: user@spark.apache.org
>>> Subject: SparkR DataFrame , Out of memory exception for very small file.
>>>
>>> Hi Users,
>>>
>>> I have a general doubt regarding DataFrames in SparkR.
>>>
>>> I am trying to read a file from Hive and it gets created as DataFrame.
>>>
>>> sqlContext <- sparkRHive.init(sc)
>>>
>>> #DF
>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true',
>>>  source = "com.databricks.spark.csv", inferSchema='true')
>>>
>>> registerTempTable(sales,"Sales")
>>>
>>> Do I need to create a new DataFrame for every update to the DataFrame
>>> like addition of new column or  need to update the original sales DataFrame.
>>>
>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a")
>>>
>>>
>>> Please help me with this , as the orignal file is only 20MB but it
>>> throws out of memory exception on a cluster of 4GB Master and Two workers
>>> of 4GB each.
>>>
>>> Also, what is the logic with DataFrame do I need to register and drop
>>> tempTable after every update??
>>>
>>> Thanks,
>>> Vipul
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Regards,
>> Vipul Rai
>> www.vipulrai.me
>> +91-8892598819
>> <http://in.linkedin.com/in/vipulrai/>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Regards,
Vipul Rai
www.vipulrai.me
+91-8892598819
<http://in.linkedin.com/in/vipulrai/>


Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Vipul Rai
Hi Nikhil,
It seems you have Kerberos enabled cluster and it is unable to authenticate
using the ticket.
Please check the Kerberos settings, it could also be because of Kerberos
version mismatch on nodes.

Thanks,
Vipul

On Tue 17 Nov, 2015 07:31 Nikhil Gs  wrote:

> Hello Team,
>
> Below is the error which we are facing in our cluster after 14 hours of
> starting the spark submit job. Not able to understand the issue and why its
> facing the below error after certain time.
>
> If any of you have faced the same scenario or if you have any idea then
> please guide us. To identify the issue, if you need any other info then
> please revert me back with the requirement.Thanks a lot in advance.
>
> *Log Error:  *
>
> 15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
>
> javax.security.sasl.SaslException: *GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]*
>
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174)
>
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:31865)
>
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1580)
>
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294)
>
> at
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126)
>
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
>
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
>
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
>
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
>
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1482)
>
> at
> org.apache.hadoop.hbase.client.HTable.put(HTable.java:1095)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper$1.run(ModempollHbaseLoadHelper.java:89)
>
> at java.security.AccessController.doPrivileged(Native
> Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:356)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper.loadToHbase(ModempollHbaseLoadHelper.java:48)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:52)
>
> at
> com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:48)
>
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:999)
>
> at
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at
> scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>
>

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-18 Thread Vipul Rai
Hi Nick/Igor,

​​
Any solution for this ?
Even I am having the same issue and copying jar to each executor is not
feasible if we use lot of jars.

Thanks,
Vipul


Creating fat jar with all resources.(Spark-Java-Maven)

2015-09-14 Thread Vipul Rai
HI All,

I have a spark app written in java,which parses the incoming log using the
headers which are in .xml. (There are many headers and logs are from 15-20
devices in various formats and separators).

I am able to run it in local mode after specifying all the resources and
passing it as parameters.

I tried creating fat jar using maven, it got created successfully but when
I run it on YARN in cluster mode it throws error *resource not found *with
the .xml files.

Can someone please throw some light on this. Any links or tutorial would
also help.

Thanks,
Vipul