SparkContext not creating due Logger initialization
Hi, Sometimes I get this error when I submit spark job. Its not like every time but when it comes up SparkContext doesn't get created. 16/12/08 08:02:18 INFO [akka.event.slf4j.Slf4jLogger] 80==> Slf4jLogger started error while starting up loggers akka.ConfigurationException: Logger specified in config can't be loaded [akka.event.slf4j.Slf4jLogger] due to [akka.event.Logging$LoggerInitializationException: Logger log1-Slf4jLogger did not respond with LoggerInitialized, sent instead [TIMEOUT]] at akka.event.LoggingBus$$anonfun$4$$anonfun$apply$1.applyOrElse(Logging.scala:116) at akka.event.LoggingBus$$anonfun$4$$anonfun$apply$1.applyOrElse(Logging.scala:115) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) at scala.util.Try$.apply(Try.scala:161) at scala.util.Failure.recover(Try.scala:185) at akka.event.LoggingBus$$anonfun$4.apply(Logging.scala:115) at akka.event.LoggingBus$$anonfun$4.apply(Logging.scala:110) at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721) at akka.event.LoggingBus$class.startDefaultLoggers(Logging.scala:110) at akka.event.EventStream.startDefaultLoggers(EventStream.scala:26) at akka.actor.LocalActorRefProvider.init(ActorRefProvider.scala:623) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:157) at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634) at akka.actor.ActorSystem$.apply(ActorSystem.scala:142) at akka.actor.ActorSystem$.apply(ActorSystem.scala:119) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288) at org.apache.spark.SparkContext.(SparkContext.scala:457) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2304) Thanks Adnan Ahmed
Re: Storing Compressed data in HDFS into Spark
I believe spark.rdd.compress requires the data to be serialized. In my case, I have data already compressed which becomes decompressed as I try to cache it. I believe even when I set spark.rdd.compress to *true, *Spark will still decompress the data and then serialize it and then compress the serialized data. Although Parquet is an option, I believe it will only make sense to use it when running Spark SQL. However, if I am using graphx or mllib will it help? Thanks, Adnan Haider B.S Candidate, Computer Science Illinois Institute of Technology On Thu, Oct 22, 2015 at 7:15 AM, Igor Berman <igor.ber...@gmail.com> wrote: > check spark.rdd.compress > > On 19 October 2015 at 21:13, ahaider3 <ahaid...@hawk.iit.edu> wrote: > >> Hi, >> A lot of the data I have in HDFS is compressed. I noticed when I load this >> data into spark and cache it, Spark unrolls the data like normal but >> stores >> the data uncompressed in memory. For example, suppose /data/ is an RDD >> with >> compressed partitions on HDFS. I then cache the data. When I call >> /data.count()/, the data is rightly decompressed since it needs to find >> the >> value of /.count()/. But, the data that is cached is also decompressed. >> Can >> a partition be compressed in spark? I know spark allows for data to be >> compressed, after serialization. But what if, I only want the partitions >> compressed. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Storing-Compressed-data-in-HDFS-into-Spark-tp25123.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: how to set spark.executor.memory and heap size
You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote: i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a file located at local file system? anyone knows the reason? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
Sorry wrong format: file:///home/wxhsdp/spark/example/standalone/README.md An extra / is needed at the start. On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote: i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a file located at local file system? anyone knows the reason? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SparkPi performance-3 cluster standalone mode
Hi, Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below: Regards, Ahsan Ijaz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-performance-3-cluster-standalone-mode-tp4530p4751.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
When I was testing spark, I faced this issue, this issue is not related to memory shortage, It is because your configurations are not correct. Try to pass you current Jar to to the SparkContext with SparkConf's setJars function and try again. On Thu, Apr 24, 2014 at 8:38 AM, wxhsdp wxh...@gmail.com wrote: by the way, codes run ok in spark shell -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Access Last Element of RDD
You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !!
Re: Access Last Element of RDD
This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !!
Re: Executing spark jobs with predefined Hadoop user
You need to use proper HDFS URI with saveAsTextFile. For Example: rdd.saveAsTextFile(hdfs://NameNode:Port/tmp/Iris/output.tmp) Regards, Adnan Asaf Lahav wrote Hi, We are using Spark with data files on HDFS. The files are stored as files for predefined hadoop user (hdfs). The folder is permitted with · read write, executable and read permission for the hdfs user · executable and read permission for users in the group · just read permission for all other users now the Spark write operation fails, due to a user mismatch of the spark context and the Hadoop user permission. Is there a way to start the Spark Context with another user than the one configured on the local machine? Please the technical details below: The permission on the hdfs folder /tmp/Iris is as follows: drwxr-xr-x - hdfs hadoop 0 2014-04-10 14:12 /tmp/Iris The Spark context is initiated on my local machine and according to the configured hdfs permission rwxr-xr-x there is no problem in loading the Hadoop hdfs file into a rdd: final JavaRDD String rdd = sparkContext.textFile(filePath); But saving the resulted rdd back to Hadoop resulst in an Hadoop security exception: rdd.saveAsTextFile(/tmp/Iris/output); Then the I receive the following Hadoop security exception: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: *Permission denied: user=halbani, access=WRITE, inode=/tmp/Iris:hdfs:hadoop:drwxr-xr-x* at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1428) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:332) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1126) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:52) at org.apache.hadoop.mapred.SparkHadoopWriter.preSetup(SparkHadoopWriter.scala:65) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:713) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:686) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:572) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:894) at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:355) at org.apache.spark.api.java.JavaRDD.saveAsTextFile(JavaRDD.scala:27) at org.apache.spark.reader.FileSpliter.split(FileSpliter.java:73) at org.apache.spark.reader.FileReaderMain.main(FileReaderMain.java:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=halbani, access=WRITE, inode=/tmp/Iris:hdfs:hadoop:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:225) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:151) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5951) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5924) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:2628) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2593) at org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:927) at sun.reflect.GeneratedMethodAccessor132.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke
Re: Executing spark jobs with predefined Hadoop user
Then problem is not on spark side, you have three options, choose any one of them: 1. Change permissions on /tmp/Iris folder from shell on NameNode with hdfs dfs -chmod command. 2. Run your hadoop service with hdfs user. 3. Disable dfs.permissions in conf/hdfs-site.xml. Regards, Adnan avito wrote Thanks Adam for the quick answer. You are absolutely right. We are indeed using the entire HDFS URI. Just for the post I have removed the name node details. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executing-spark-jobs-with-predefined-Hadoop-user-tp4059p4063.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
How to execute a function from class in distributed jar on each worker node?
Hello, I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am connecting with Spark Master from scala using SparkContext. I am trying to execute a simple java function from the distributed jar on every Spark Worker but haven't found a way to communicate with each worker or a Spark API function to do it. Can somebody help me with it or point me in the right direction? Regards, Adnan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-execute-a-function-from-class-in-distributed-jar-on-each-worker-node-tp3870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.