Re: SparkR DataFrame , Out of memory exception for very small file.
Hi Jeff, This is only part of the actual code. My questions are mentioned in comments near the code. SALES<- SparkR::sql(hiveContext, "select * from sales") PRICING<- SparkR::sql(hiveContext, "select * from pricing") ## renaming of columns ## #sales file# # Is this right ??? Do we have to create a new DF for every column Addition to the original DF. # And if we do that , then what about the older DF , they will also take memory ? names(SALES)[which(names(SALES)=="div_no")]<-"DIV_NO" names(SALES)[which(names(SALES)=="store_no")]<-"STORE_NO" #pricing file# names(PRICING)[which(names(PRICING)=="price_type_cd")]<-"PRICE_TYPE" names(PRICING)[which(names(PRICING)=="price_amt")]<-"PRICE_AMT" registerTempTable(SALES,"sales") registerTempTable(PRICING,"pricing") #merging sales and pricing file# merg_sales_pricing<- SparkR::sql(hiveContext,"select .") head(merg_sales_pricing) Thanks, Vipul On 23 November 2015 at 14:52, Jeff Zhang <zjf...@gmail.com> wrote: > If possible, could you share your code ? What kind of operation are you > doing on the dataframe ? > > On Mon, Nov 23, 2015 at 5:10 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > >> Hi Zeff, >> >> Thanks for the reply, but could you tell me why is it taking so much time. >> What could be wrong , also when I remove the DataFrame from memory using >> rm(). >> It does not clear the memory but the object is deleted. >> >> Also , What about the R functions which are not supported in SparkR. >> Like ddply ?? >> >> How to access the nth ROW of SparkR DataFrame. >> >> Regards, >> Vipul >> >> On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> >>> Do I need to create a new DataFrame for every update to the >>> DataFrame like >>> addition of new column or need to update the original sales DataFrame. >>> >>> Yes, DataFrame is immutable, and every mutation of DataFrame will >>> produce a new DataFrame. >>> >>> >>> >>> On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> >>> wrote: >>> >>>> Hello Rui, >>>> >>>> Sorry , What I meant was the resultant of the original dataframe to >>>> which a new column was added gives a new DataFrame. >>>> >>>> Please check this for more >>>> >>>> https://spark.apache.org/docs/1.5.1/api/R/index.html >>>> >>>> Check for >>>> WithColumn >>>> >>>> >>>> Thanks, >>>> Vipul >>>> >>>> >>>> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: >>>> >>>>> Vipul, >>>>> >>>>> Not sure if I understand your question. DataFrame is immutable. You >>>>> can't update a DataFrame. >>>>> >>>>> Could you paste some log info for the OOM error? >>>>> >>>>> -Original Message- >>>>> From: vipulrai [mailto:vipulrai8...@gmail.com] >>>>> Sent: Friday, November 20, 2015 12:11 PM >>>>> To: user@spark.apache.org >>>>> Subject: SparkR DataFrame , Out of memory exception for very small >>>>> file. >>>>> >>>>> Hi Users, >>>>> >>>>> I have a general doubt regarding DataFrames in SparkR. >>>>> >>>>> I am trying to read a file from Hive and it gets created as DataFrame. >>>>> >>>>> sqlContext <- sparkRHive.init(sc) >>>>> >>>>> #DF >>>>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >>>>> source = "com.databricks.spark.csv", >>>>> inferSchema='true') >>>>> >>>>> registerTempTable(sales,"Sales") >>>>> >>>>> Do I need to create a new DataFrame for every update to the DataFrame >>>>> like addition of new column or need to update the original sales >>>>> DataFrame. >>>>> >>>>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as >>>>> a") >>>>> >>>>> >>>>> Please help me with this , as the orignal file is only 20MB but it >>>>> throws out of memory exception on a cluster of 4GB Master and Two workers >>>>> of 4GB each. >>>>> >>>>> Also, what is the logic with DataFrame do I need to register and drop >>>>> tempTable after every update?? >>>>> >>>>> Thanks, >>>>> Vipul >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>>>> additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vipul Rai >>>> www.vipulrai.me >>>> +91-8892598819 >>>> <http://in.linkedin.com/in/vipulrai/> >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> Regards, >> Vipul Rai >> www.vipulrai.me >> +91-8892598819 >> <http://in.linkedin.com/in/vipulrai/> >> > > > > -- > Best Regards > > Jeff Zhang > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: SparkR DataFrame , Out of memory exception for very small file.
Hello Rui, Sorry , What I meant was the resultant of the original dataframe to which a new column was added gives a new DataFrame. Please check this for more https://spark.apache.org/docs/1.5.1/api/R/index.html Check for WithColumn Thanks, Vipul On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: > Vipul, > > Not sure if I understand your question. DataFrame is immutable. You can't > update a DataFrame. > > Could you paste some log info for the OOM error? > > -Original Message- > From: vipulrai [mailto:vipulrai8...@gmail.com] > Sent: Friday, November 20, 2015 12:11 PM > To: user@spark.apache.org > Subject: SparkR DataFrame , Out of memory exception for very small file. > > Hi Users, > > I have a general doubt regarding DataFrames in SparkR. > > I am trying to read a file from Hive and it gets created as DataFrame. > > sqlContext <- sparkRHive.init(sc) > > #DF > sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', > source = "com.databricks.spark.csv", inferSchema='true') > > registerTempTable(sales,"Sales") > > Do I need to create a new DataFrame for every update to the DataFrame like > addition of new column or need to update the original sales DataFrame. > > sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") > > > Please help me with this , as the orignal file is only 20MB but it throws > out of memory exception on a cluster of 4GB Master and Two workers of 4GB > each. > > Also, what is the logic with DataFrame do I need to register and drop > tempTable after every update?? > > Thanks, > Vipul > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: SparkR DataFrame , Out of memory exception for very small file.
Hi Zeff, Thanks for the reply, but could you tell me why is it taking so much time. What could be wrong , also when I remove the DataFrame from memory using rm(). It does not clear the memory but the object is deleted. Also , What about the R functions which are not supported in SparkR. Like ddply ?? How to access the nth ROW of SparkR DataFrame. Regards, Vipul On 23 November 2015 at 14:25, Jeff Zhang <zjf...@gmail.com> wrote: > >>> Do I need to create a new DataFrame for every update to the DataFrame > like > addition of new column or need to update the original sales DataFrame. > > Yes, DataFrame is immutable, and every mutation of DataFrame will produce > a new DataFrame. > > > > On Mon, Nov 23, 2015 at 4:44 PM, Vipul Rai <vipulrai8...@gmail.com> wrote: > >> Hello Rui, >> >> Sorry , What I meant was the resultant of the original dataframe to which >> a new column was added gives a new DataFrame. >> >> Please check this for more >> >> https://spark.apache.org/docs/1.5.1/api/R/index.html >> >> Check for >> WithColumn >> >> >> Thanks, >> Vipul >> >> >> On 23 November 2015 at 12:42, Sun, Rui <rui@intel.com> wrote: >> >>> Vipul, >>> >>> Not sure if I understand your question. DataFrame is immutable. You >>> can't update a DataFrame. >>> >>> Could you paste some log info for the OOM error? >>> >>> -Original Message- >>> From: vipulrai [mailto:vipulrai8...@gmail.com] >>> Sent: Friday, November 20, 2015 12:11 PM >>> To: user@spark.apache.org >>> Subject: SparkR DataFrame , Out of memory exception for very small file. >>> >>> Hi Users, >>> >>> I have a general doubt regarding DataFrames in SparkR. >>> >>> I am trying to read a file from Hive and it gets created as DataFrame. >>> >>> sqlContext <- sparkRHive.init(sc) >>> >>> #DF >>> sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', >>> source = "com.databricks.spark.csv", inferSchema='true') >>> >>> registerTempTable(sales,"Sales") >>> >>> Do I need to create a new DataFrame for every update to the DataFrame >>> like addition of new column or need to update the original sales DataFrame. >>> >>> sales1<- SparkR::sql(sqlContext,"Select a.* , 607 as C1 from Sales as a") >>> >>> >>> Please help me with this , as the orignal file is only 20MB but it >>> throws out of memory exception on a cluster of 4GB Master and Two workers >>> of 4GB each. >>> >>> Also, what is the logic with DataFrame do I need to register and drop >>> tempTable after every update?? >>> >>> Thanks, >>> Vipul >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-DataFrame-Out-of-memory-exception-for-very-small-file-tp25435.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Regards, >> Vipul Rai >> www.vipulrai.me >> +91-8892598819 >> <http://in.linkedin.com/in/vipulrai/> >> > > > > -- > Best Regards > > Jeff Zhang > -- Regards, Vipul Rai www.vipulrai.me +91-8892598819 <http://in.linkedin.com/in/vipulrai/>
Re: Spark Job is getting killed after certain hours
Hi Nikhil, It seems you have Kerberos enabled cluster and it is unable to authenticate using the ticket. Please check the Kerberos settings, it could also be because of Kerberos version mismatch on nodes. Thanks, Vipul On Tue 17 Nov, 2015 07:31 Nikhil Gswrote: > Hello Team, > > Below is the error which we are facing in our cluster after 14 hours of > starting the spark submit job. Not able to understand the issue and why its > facing the below error after certain time. > > If any of you have faced the same scenario or if you have any idea then > please guide us. To identify the issue, if you need any other info then > please revert me back with the requirement.Thanks a lot in advance. > > *Log Error: * > > 15/11/16 04:54:48 ERROR ipc.AbstractRpcClient: SASL authentication failed. > The most likely cause is missing or invalid credentials. Consider 'kinit'. > > javax.security.sasl.SaslException: *GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to > find any Kerberos tgt)]* > > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) > > at > org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:731) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:728) > > at java.security.AccessController.doPrivileged(Native > Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:728) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:881) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:850) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1174) > > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216) > > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300) > > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:31865) > > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1580) > > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1294) > > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126) > > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369) > > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320) > > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206) > > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183) > > at > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1482) > > at > org.apache.hadoop.hbase.client.HTable.put(HTable.java:1095) > > at > com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper$1.run(ModempollHbaseLoadHelper.java:89) > > at java.security.AccessController.doPrivileged(Native > Method) > > at javax.security.auth.Subject.doAs(Subject.java:356) > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651) > > at > com.suxk.bigdata.pulse.consumer.ModempollHbaseLoadHelper.loadToHbase(ModempollHbaseLoadHelper.java:48) > > at > com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:52) > > at > com.suxk.bigdata.pulse.consumer.ModempollSparkStreamingEngine$1.call(ModempollSparkStreamingEngine.java:48) > > at > org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:999) > > at > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at > scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > >
Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar
Hi Nick/Igor, Any solution for this ? Even I am having the same issue and copying jar to each executor is not feasible if we use lot of jars. Thanks, Vipul
Creating fat jar with all resources.(Spark-Java-Maven)
HI All, I have a spark app written in java,which parses the incoming log using the headers which are in .xml. (There are many headers and logs are from 15-20 devices in various formats and separators). I am able to run it in local mode after specifying all the resources and passing it as parameters. I tried creating fat jar using maven, it got created successfully but when I run it on YARN in cluster mode it throws error *resource not found *with the .xml files. Can someone please throw some light on this. Any links or tutorial would also help. Thanks, Vipul