RE: Exiting driver main() method...
No, you don’t need to do anything special. Perhaps, your application is getting stuck somewhere? If you can share your code, someone may be able to help. Mohammed From: James Carman [mailto:ja...@carmanconsulting.com] Sent: Friday, May 1, 2015 5:53 AM To: user@spark.apache.org Subject: Exiting driver main() method... In all the examples, it seems that the spark application doesn't really do anything special in order to exit. When I run my application, however, the spark-submit script just hangs there at the end. Is there something special I need to do to get that thing to exit normally?
Re: real time Query engine Spark-SQL on Hbase
Hi, Thanks for the reply. Hbase cli takes less than 500 ms for the same query. I am running a simple query i.t Select * from Customers where c_id='123123'. Why would the same query which takes 500 ms at Hbase cli end up taking around 8 secs via Spark-Sql? I am unable t understand this. Thanks, Siddharth From: ayan guha guha.a...@gmail.com Sent: 01 May 2015 04:38 To: Ted Yu Cc: user@spark.apache.org; Siddharth Ubale; matei.zaha...@gmail.com; Prakash Hosalli; Amit Kumar Subject: Re: real time Query engine Spark-SQL on Hbase And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: bq. a single query on one filter criteria Can you tell us more about your filter ? How selective is it ? Which hbase release are you using ? Cheers On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale siddharth.ub...@syncoms.commailto:siddharth.ub...@syncoms.com wrote: Hi, I want to use Spark as Query engine on HBase with sub second latency. I am using Spark 1.3 version. And followed the steps below on Hbase table with around 3.5 lac rows : 1. Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase table which is used to create the Dataframe. DataFrame schemaCustomers = sqlInstance .createDataFrame(SparkContextImpl.getRddCustomers(), Customers.class); 2. Used registertemp table i.e schemaCustomers.registerTempTable(customers); 3. Running the query on Dataframe using Sqlcontext Instance. What I am observing is that for a single query on one filter criteria the query is taking 7-8 seconds? And the time increases as I am increasing the number of rows in Hbase table. Also, there was one time when I was getting query response under 1-2 seconds. Seems like strange behavior. Is this expected behavior from Spark or am I missing something here? Can somebody help me understand this scenario . Please assist. Thanks, Siddharth Ubale,
com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:
Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab
not getting any mail
Hi All I am not getting any mail from this community?
Remoting warning when submitting to cluster
Hello all!! We've been prototyping some spark applications to read messages from Kafka topics. The application is quite simple, we use KafkaUtils.createStream to receive a stream of CSV messages from a Kafka Topic. We parse the CSV and count the number of messages we get in each RDD. At a high-level (removing the abstractions of our appliction), it looks like this: val sc = new SparkConf() .setAppName(appName) .set(spark.executor.memory, 1024m) .set(spark.cores.max, 3) .set(spark.app.name, appName) .set(spark.ui.port, sparkUIPort) val ssc = new StreamingContext(sc, Milliseconds(emitInterval.toInt)) KafkaUtils .createStream(ssc, zookeeperQuorum, consumerGroup, topicMap) .map(_._2) .foreachRDD( (rdd:RDD, time: Time) = { println(Time %s: (%s total records).format(time, rdd.count())) } When I submit this using to spark master as local[3] everything behaves as I'd expect. After some startup overhead, I'm seeing the count printed to be the same as the count I'm simulating (1 every second for example). When I submit this to a spark master using spark://master.host:7077, the behavior is different. The overhead go start receiving seems longer and some runs I don't see anything for 30 seconds even though my simulator is sending messages to the topic. I also see the following error written to stderr by every executor assigned to the job: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/05/01 10:11:38 INFO SecurityManager: Changing view acls to: username 15/05/01 10:11:38 INFO SecurityManager: Changing modify acls to: username 15/05/01 10:11:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(javi4211); users with modify permissions: Set(username) 15/05/01 10:11:38 INFO Slf4jLogger: Slf4jLogger started 15/05/01 10:11:38 INFO Remoting: Starting remoting 15/05/01 10:11:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@master.host:56534] 15/05/01 10:11:39 INFO Utils: Successfully started service 'driverPropsFetcher' on port 56534. 15/05/01 10:11:40 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@driver.host:51837]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: driver.host/10.27.51.214:51837 15/05/01 10:12:09 ERROR UserGroupInformation: PriviledgedActionException as:username cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:224) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) Is there something else I need to do configure to ensure akka remoting will work correctly when running spark cluster? Or can I ignore this error? -Javier -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Remoting-warning-when-submitting-to-cluster-tp22733.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark - Hive Metastore MySQL driver
Can you try the patch from: [SPARK-6913][SQL] Fixed java.sql.SQLException: No suitable driver found Cheers On Sat, Mar 28, 2015 at 12:41 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: This is from my Hive installation -sh-4.1$ ls /apache/hive/lib | grep derby derby-10.10.1.1.jar derbyclient-10.10.1.1.jar derbynet-10.10.1.1.jar -sh-4.1$ ls /apache/hive/lib | grep datanucleus datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar -sh-4.1$ ls /apache/hive/lib | grep mysql mysql-connector-java-5.0.8-bin.jar -sh-4.1$ $ hive --version Hive 0.13.0.2.1.3.6-2 Subversion git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6 -r 87da9430050fb9cc429d79d95626d26ea382b96c $ On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I tried with a different version of driver but same error ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar, */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: This is what am seeing ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar --files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver (com.mysql.jdbc.Driver) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar *--files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 Caused by: java.sql.SQLException: No suitable driver found for jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB at java.sql.DriverManager.getConnection(DriverManager.java:596) Looks like the driver jar that i got in is not correct, On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Could someone please share the spark-submit command that shows their mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through
Re: How to add a column to a spark RDD with many columns?
val newRdd = myRdd.map(row = row ++ Array((row(1).toLong * row(199).toLong).toString)) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22735.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exiting driver main() method...
It used to exit without any problem for me. You can basically check in the driver UI (that runs on 4040) and see what exactly its doing. Thanks Best Regards On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com wrote: In all the examples, it seems that the spark application doesn't really do anything special in order to exit. When I run my application, however, the spark-submit script just hangs there at the end. Is there something special I need to do to get that thing to exit normally?
empty jdbc RDD in spark
Hi all! I am trying to read hana database using spark jdbc RDD here is my code def readFromHana() { val conf = new SparkConf() conf.setAppName(test).setMaster(local) val sc = new SparkContext(conf) val rdd = new JdbcRDD(sc, () = { Class.forName(com.sap.db.jdbc.Driver).newInstance() DriverManager.getConnection(jdbc:sap://54.69.200.113:30015/?currentschema=LIVE2, mujadid, 786Xyz123) }, SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1, (r: ResultSet) = convert(r) ) println(rdd.count()); sc.stop() } def convert(rs: ResultSet):String={ val rsmd = rs.getMetaData() val numberOfColumns = rsmd.getColumnCount() var i = 1 val row=new StringBuilder while (i = numberOfColumns) { row.append( rs.getString(i)+,) i += 1 } row.toString() } The resultant count is 0 Any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Drop a column from the DataFrame.
Just use select() to create a new DataFrame with only the columns you want. Sort of the opposite of what you want -- but you can select all but the columns you want minus the one you don. You could even use a filter to remove just the one column you want on the fly: myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname = new Column(colname)).toList : _* ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.logConf with log4j.rootCategory=WARN
It could be. Thanks Best Regards On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote: Hi, I have recently enable log4j.rootCategory=WARN, console in spark configuration. but after that spark.logConf=True has becomes ineffective. So just want to confirm if this is because log4j.rootCategory=WARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-logConf-with-log4j-rootCategory-WARN-tp22731.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming Kafka Avro NPE on deserialization of payload
There was a similar discussion over here http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E Thanks Best Regards On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote: *Resending as I do not see that this made it to the mailing list, sorry if in fact it did an is just nor reflected online yet.* I’m very perplexed with the following. I have a set of AVRO generated objects that are sent to a SparkStreaming job via Kafka. The SparkStreaming job follows the receiver-based approach. I am encountering the below error when I attempt to de serialize the payload: 15/04/30 17:49:25 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 9 to sparkExecutor@192.168.1.3:6105115/04/30 17:49:25 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 140 bytes15/04/30 17:49:25 ERROR TaskResultGetter: Exception while getting task resultcom.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: relations (com.opsdatastore.model.ObjectDetails) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 17 more15/04/30 17:49:25 INFO TaskSchedulerImpl: Removed TaskSet 20.0, whose tasks have all completed, from pool Basic code looks like this. Register the class with Kryo as follows: val sc = new SparkConf(true) .set(spark.streaming.unpersist, true) .setAppName(StreamingKafkaConsumer) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) // register all related AVRO generated classes sc.registerKryoClasses(Array( classOf[ConfigurationProperty], classOf[Event], classOf[Identifier], classOf[Metric], classOf[ObjectDetails], classOf[Relation], classOf[RelationProperty] )) Use the receiver based approach to consume messages from Kafka: val messages = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics, storageLevel) Now process the received messages: val raw = messages.map(_._2) val dStream = raw.map( byte = { // Avro Decoder println(Byte length: + byte.length) val decoder = new AvroDecoder[ObjectDetails](schema = ObjectDetails.getClassSchema) val message = decoder.fromBytes(byte) println(sAvroMessage : Type : ${message.getType}, Payload : $message) message } ) When i look in the logs of the workers, in standard out i can se the messages being printed, in fact I’m even able to access the Type field with out issue: Byte length: 315 AvroMessage : Type : Storage, Payload : {name: Storage 1, type: Storage, vendor: 6274g51cbkmkqisk, model: lk95hqk9m10btaot, timestamp: 1430428565141, identifiers: {ID: {name: ID, value: Storage-1}}, configuration: null, metrics: {Disk Space Usage (GB): {name: Disk Space Usage
Re: Spark worker error on standalone cluster
Thanks Akhil, I am trying to investigate this path. The spark is the same, but may be there is a difference in Hadoop. On Sat, May 2, 2015 at 6:25 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Just make sure your are having the same version of spark in your cluster and the project's build file. Thanks Best Regards On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) mich...@totango.com wrote: Hi everyone, I have a spark application that works fine on a standalone Spark cluster that runs on my laptop (master and one worker), but fails when I try to run in on a standalone Spark cluster deployed on EC2 (master and worker are on different machines). The application structure goes in the following way: There is a java process ('message processor') that runs on the same machine as Spark master. When it starts, it submits itself to Spark master, then, it listens on SQS and on each received message, it should run a spark job to process a file from S3, which address is configured in the message . It looks like all this fails at the point where the Spark driver tries to send the job to the Spark executer. Below is the code from the 'message processor' that configures the SparkContext, Then the Spark driver log, and then the Spark executor log. The outputs of my code and some important points are marked in bold and I've simplified the code and logs in some places for the sake of readability. Would appreciate your help very much, because I've run out of ideas with this problem. 'message processor' code: === === || logger.info(*Started Integration Hub SubmitDriver in test mode*.); SparkConf sparkConf = new SparkConf() .setMaster(SPARK_MASTER_URI) .setAppName(APPLICATION_NAME) .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE); sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass())); // configure spark executor to use log4j properties located in the local spark conf dir sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties); sparkConf.set(spark.executor.memory, 1g); sparkConf.set(spark.cores.max, 3); // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced performance sparkConf.set(spark.shuffle.spill, true); logger.info(*Connecting Spark*); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY); sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET); logger.info(*Spark connected*); || == Driver log: ==|| 2015-05-01 07:47:14 INFO ClassPathBeanDefinitionScanner:239 - JSR-330 'javax.inject.Named' annotation found and supported for component scanning 2015-05-01 07:47:14 INFO AnnotationConfigApplicationContext:510 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b : startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy 2015-05-01 07:47:14 INFO AutowiredAnnotationBeanPostProcessor:140 - JSR-330 'javax.inject.Inject' annotation found and supported for autowiring 2015-05-01 07:47:14 INFO DefaultListableBeanFactory:596 - Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e : defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver]; root of factory hierarchy *2015-05-01 07:47:15 INFO SubmitDriver:69 - Started Integration Hub SubmitDriver in test mode. 2015-05-01 07:47:15 INFO SubmitDriver:101 - Connecting Spark *2015-05-01 07:47:15 INFO SparkContext:59 - Running Spark version 1.3.0 2015-05-01 07:47:16 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-05-01 07:47:16 INFO SecurityManager:59 - Changing view acls to: hadoop 2015-05-01 07:47:16 INFO SecurityManager:59 - Changing modify acls to: hadoop 2015-05-01 07:47:16 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls
spark filestrea problem
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception
sparkR equivalent to SparkContext.newAPIHadoopRDD?
Hi gang, I'm giving sparkR a test drive and am bummed to discover that the SparkContext API in sparkR is only a subset of what's available in stock spark. Specifically, I need to be able to pull data from accumulo into sparkR. I can do it with stock spark but can't figure out how to make the magic happen with sparkR. Anyone got any ideas? thanks! DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com] www.AnnaiSystems.comhttp://www.AnnaiSystems.com
Problem in Standalone Mode
When I run my program with Spark-Submit everythink are ok. But when I try run in satandalone mode I obtain the nex Exceptions: ((This is with val df = sqlContext.jsonFile(./datos.json) )) java.io.EOFException [error] at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744) This is SparkConf val sparkConf = new SparkConf().setAppName(myApp) .setMaster(spark://master:7077) .setSparkHome(/usr/local/spark/) .setJars(Seq(./target/scala-2.10/myApp.jar)) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Standalone-Mode-tp22741.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Help with publishing to Kafka from Spark Streaming?
Here is the pull request, you may refer to this: https://github.com/apache/spark/pull/2994 Thanks Jerry 2015-05-01 14:38 GMT+08:00 Pavan Sudheendra pavan0...@gmail.com: Link to the question: http://stackoverflow.com/questions/29974017/spark-kafka-producer-not-serializable-exception Thanks for any pointers.
Re: how to pass configuration properties from driver to executor?
Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped to the executors. Thanks Best Regards On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com wrote: Hi, We've had a similar problem, but with log4j properties file. The only working way we've found, was externally deploying the properties file on the worker machine to the spark conf folder and configuring the executor jvm options with: sparkConf.set(spark.executor.extraJavaOptions, -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties); It is done in our case from the java code, but I think it can be also a parameter to spark-submit. Regards, Michael. On Fri, May 1, 2015 at 2:26 AM, Tian Zhang tzhang...@yahoo.com wrote: Hi, We have a scenario as below and would like your suggestion. We have app.conf file with propX=A as default built into the fat jar file that is provided to spark-submit WE have env.conf file with propX=B that would like spark-submit to take as input to overwrite the default and populate to both driver and executors. Note in the executor, we are using some package that is using typesafe config to read configuration properties. How do we do that? Thanks. Tian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-pass-configuration-properties-from-driver-to-executor-tp22728.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[PSA] Use Stack Overflow!
This mailing list sees a lot of traffic every day. With such a volume of mail, you may find it hard to find discussions you are interested in, and if you are the one starting discussions you may sometimes feel your mail is going into a black hole. We can't change the nature of this mailing list (it is required by the Apache foundation), but there is an alternative in Stack Overflow http://stackoverflow.com/. Stack Overflow has an active Apache Spark tag http://stackoverflow.com/questions/tagged/apache-spark, and Spark committers like Sean Owen http://stackoverflow.com/users/64174/sean-owen and Josh Rosen http://stackoverflow.com/users/590203/josh-rosen are active under that tag, along with several contributors and of course regular users. Try it out! I think when your question fits Stack Overflow's guidelines http://stackoverflow.com/help/on-topic, it is generally better to ask it there than on this mailing list. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PSA-Use-Stack-Overflow-tp22732.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Generating version agnostic jar path value for --jars clause
I have a list of cloudera jars which I need to provide in --jars clause, mainly for the HiveContext functionality I am using. However, many of these jars have version number as part of their names. This leads to an issue that the names might change when I do a Cloudera upgrade. Just a note here, there are many jars which cloudera exposes as a symlink which is the link to the latest version of that jar(e.g /opt/cloudera/parcels/CDH/lib/parquet/parquet-hadoop-bundle.jar - parquet-hadoop-bundle-1.5.0-cdh5.3.2.jar),in which case its good but there are many jars which aren't. Is there a flexible way to avoid this situation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Generating-version-agnostic-jar-path-value-for-jars-clause-tp22734.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark worker error on standalone cluster
Just make sure your are having the same version of spark in your cluster and the project's build file. Thanks Best Regards On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) mich...@totango.com wrote: Hi everyone, I have a spark application that works fine on a standalone Spark cluster that runs on my laptop (master and one worker), but fails when I try to run in on a standalone Spark cluster deployed on EC2 (master and worker are on different machines). The application structure goes in the following way: There is a java process ('message processor') that runs on the same machine as Spark master. When it starts, it submits itself to Spark master, then, it listens on SQS and on each received message, it should run a spark job to process a file from S3, which address is configured in the message . It looks like all this fails at the point where the Spark driver tries to send the job to the Spark executer. Below is the code from the 'message processor' that configures the SparkContext, Then the Spark driver log, and then the Spark executor log. The outputs of my code and some important points are marked in bold and I've simplified the code and logs in some places for the sake of readability. Would appreciate your help very much, because I've run out of ideas with this problem. 'message processor' code: === === || logger.info(*Started Integration Hub SubmitDriver in test mode*.); SparkConf sparkConf = new SparkConf() .setMaster(SPARK_MASTER_URI) .setAppName(APPLICATION_NAME) .setSparkHome(SPARK_LOCATION_ON_EC2_MACHINE); sparkConf.setJars(JavaSparkContext.jarOfClass(this.getClass())); // configure spark executor to use log4j properties located in the local spark conf dir sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j_integrationhub_sparkexecutor.properties); sparkConf.set(spark.executor.memory, 1g); sparkConf.set(spark.cores.max, 3); // Spill shuffle to disk to avoid OutOfMemory, at cost of reduced performance sparkConf.set(spark.shuffle.spill, true); logger.info(*Connecting Spark*); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, AWS_KEY); sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, AWS_SECRET); logger.info(*Spark connected*); || == Driver log: ==|| 2015-05-01 07:47:14 INFO ClassPathBeanDefinitionScanner:239 - JSR-330 'javax.inject.Named' annotation found and supported for component scanning 2015-05-01 07:47:14 INFO AnnotationConfigApplicationContext:510 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@5540b23b : startup date [Fri May 01 07:47:14 UTC 2015]; root of context hierarchy 2015-05-01 07:47:14 INFO AutowiredAnnotationBeanPostProcessor:140 - JSR-330 'javax.inject.Inject' annotation found and supported for autowiring 2015-05-01 07:47:14 INFO DefaultListableBeanFactory:596 - Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@13f948e : defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,integrationHubConfig,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,processorInlineDriver,s3Accessor,cdFetchUtil,httpUtil,cdPushUtil,submitDriver,databaseLogger,connectorUtil,totangoDataValidations,environmentConfig,sesUtil,processorExecutor,processorDriver]; root of factory hierarchy *2015-05-01 07:47:15 INFO SubmitDriver:69 - Started Integration Hub SubmitDriver in test mode. 2015-05-01 07:47:15 INFO SubmitDriver:101 - Connecting Spark *2015-05-01 07:47:15 INFO SparkContext:59 - Running Spark version 1.3.0 2015-05-01 07:47:16 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-05-01 07:47:16 INFO SecurityManager:59 - Changing view acls to: hadoop 2015-05-01 07:47:16 INFO SecurityManager:59 - Changing modify acls to: hadoop 2015-05-01 07:47:16 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 2015-05-01 07:47:18 INFO Slf4jLogger:80 - Slf4jLogger started 2015-05-01 07:47:18 INFO Remoting:74 -
spark filestream problem
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22742.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Enabling Event Log
is it working now? On 1 May 2015 at 13:43, James King jakwebin...@gmail.com wrote: Oops! well spotted. Many thanks Shixiong. On Fri, May 1, 2015 at 1:25 AM, Shixiong Zhu zsxw...@gmail.com wrote: spark.history.fs.logDirectory is for the history server. For Spark applications, they should use spark.eventLog.dir. Since you commented out spark.eventLog.dir, it will be /tmp/spark-events. And this folder does not exits. Best Regards, Shixiong Zhu 2015-04-29 23:22 GMT-07:00 James King jakwebin...@gmail.com: I'm unclear why I'm getting this exception. It seems to have realized that I want to enable Event Logging but ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which does exist. spark-default.conf # Example: spark.master spark://master1:7077,master2:7077 spark.eventLog.enabled true spark.history.fs.logDirectoryfile:/opt/cb/tmp/spark-events # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three Exception following job submission: spark.eventLog.enabled=true spark.history.fs.logDirectory=file:/opt/cb/tmp/spark-events spark.jars=file:/opt/cb/scripts/spark-streamer/cb-spark-streamer-1.0-SNAPSHOT.jar spark.master=spark://master1:7077,master2:7077 Exception in thread main java.lang.IllegalArgumentException: Log directory /tmp/spark-events does not exist. at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:99) at org.apache.spark.SparkContext.init(SparkContext.scala:399) at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642) at org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:75) at org.apache.spark.streaming.api.java.JavaStreamingContext.init(JavaStreamingContext.scala:132) Many Thanks jk
to split an RDD to multiple ones?
Hi, I have an RDD srdd containing (unordered-)data like this: s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, … What I want is (it will be much better if they could be in ascending order): srdd_s1: s1_0, s1_1, s1_2, …, s1_n srdd_s2: s2_0, s2_1, s2_2, …, s2_n srdd_s3: s3_0, s3_1, s3_2, …, s3_n … … Have any idea? Thanks in advance! :) Best, Yifan LI
Re: DataFrame filter referencing error
First of all, thank you for your replies. I was previously doing this via normal jdbc connection and it worked without problems. Then I liked the idea that sparksql could take care of opening/closing the connection. I tried also with single quotes, since that was my first guess but didn't work. I fear I will have to look at the spark code but I'm the only one with this issue. BTW I'm testing with spark 1.3.0 Best, Francesco On Fri, May 1, 2015, 00:54 ayan guha guha.a...@gmail.com wrote: I think you need to specify new in single quote. My guess is the query showing up in dB is like ...where status=new or ...where status=new Either case mysql assumes new is a column. What you need is the form below ...where status='new' You need to provide your quotes accordingly. Easiest way would be to do it would in a separate jdbc conn to mysql using a simple standalone programme, not in spark. On 1 May 2015 07:47, Burak Yavuz brk...@gmail.com wrote: Is new a reserved word for MySQL? On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella francesco.bigare...@gmail.com wrote: Do you know how I can check that? I googled a bit but couldn't find a clear explanation about it. I also tried to use explain() but it doesn't really help. I still find unusual that I have this issue only for the equality operator but not for the others. Thank you, F On Wed, Apr 29, 2015 at 3:03 PM ayan guha guha.a...@gmail.com wrote: Looks like you DF is based on a MySQL DB using jdbc, and error is thrown from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark is pushing down the predicate to mysql so its not a spark problem perse On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella francesco.bigare...@gmail.com wrote: Hi all, I was testing the DataFrame filter functionality and I found what I think is a strange behaviour. My dataframe testDF, obtained loading aMySQL table via jdbc, has the following schema: root | -- id: long (nullable = false) | -- title: string (nullable = true) | -- value: string (nullable = false) | -- status: string (nullable = false) What I want to do is filter my dataset to obtain all rows that have a status = new. scala testDF.filter(testDF(id) === 1234).first() works fine (also with the integer value within double quotes), however if I try to use the same statement to filter on the status column (also with changes in the syntax - see below), suddenly the program breaks. Any of the following scala testDF.filter(testDF(status) === new) scala testDF.filter(status = 'new') scala testDF.filter($status === new) generates the error: INFO scheduler.DAGScheduler: Job 3 failed: runJob at SparkPlan.scala:121, took 0.277907 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 12, node name): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'new' in 'where clause' at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) at com.mysql.jdbc.Util.getInstance(Util.java:386) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625) at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119) at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283) at org.apache.spark.sql.jdbc.JDBCRDD$anon$1.init(JDBCRDD.scala:328) at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
spark filestream problem
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem-tp22743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: empty jdbc RDD in spark
bq. SELECT * FROM MEMBERS LIMIT ? OFFSET ?, Have you tried dropping limit and offset clause from the above query ? Cheers On Fri, May 1, 2015 at 1:56 PM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi all! I am trying to read hana database using spark jdbc RDD here is my code def readFromHana() { val conf = new SparkConf() conf.setAppName(test).setMaster(local) val sc = new SparkContext(conf) val rdd = new JdbcRDD(sc, () = { Class.forName(com.sap.db.jdbc.Driver).newInstance() DriverManager.getConnection(jdbc:sap:// 54.69.200.113:30015/?currentschema=LIVE2, mujadid, 786Xyz123) }, SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1, (r: ResultSet) = convert(r) ) println(rdd.count()); sc.stop() } def convert(rs: ResultSet):String={ val rsmd = rs.getMetaData() val numberOfColumns = rsmd.getColumnCount() var i = 1 val row=new StringBuilder while (i = numberOfColumns) { row.append( rs.getString(i)+,) i += 1 } row.toString() } The resultant count is 0 Any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/empty-jdbc-RDD-in-spark-tp22736.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Number of input partitions in SparkContext.sequenceFile
Hi, How did u check no of splits in ur file. Did i run ur mr job or calculated it.? The formula for split size is max(minSize, min(max size, block size)). Can u check if it satisfies ur case.? Thanks Regards, Archit Thakur. On Saturday, April 25, 2015, Wenlei Xie wenlei@gmail.com wrote: Hi, I checked the number of partitions by System.out.println(INFO: RDD with + rdd.partitions().size() + partitions created.); Each single split is about 100MB. I am currently loading the data from local file system, would this explains this observation? Thank you! Best, Wenlei On Tue, Apr 21, 2015 at 6:28 AM, Archit Thakur archit279tha...@gmail.com javascript:_e(%7B%7D,'cvml','archit279tha...@gmail.com'); wrote: Hi, It should generate the same no of partitions as the no. of splits. Howd you check no of partitions.? Also please paste your file size and hdfs-site.xml and mapred-site.xml here. Thanks and Regards, Archit Thakur. On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wenlei@gmail.com javascript:_e(%7B%7D,'cvml','wenlei@gmail.com'); wrote: Hi, I am wondering the mechanism that determines the number of partitions created by SparkContext.sequenceFile ? For example, although my file has only 4 splits, Spark would create 16 partitions for it. Is it determined by the file size? Is there any way to control it? (Looks like I can only tune minPartitions but not maxPartitions) Thank you! Best, Wenlei -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei@gmail.com javascript:_e(%7B%7D,'cvml','wenlei@gmail.com');
Re: How to add a column to a spark RDD with many columns?
Thanks for your reply! It is what I am after. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: not getting any mail
Looks like there were delays across Apache project mailing lists. Emails are coming through now. On May 2, 2015, at 9:14 AM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I am not getting any mail from this community? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: spark filestream problem
I have figured it out in the meantime - simply when moving file on HDFS it preserves its time stamp and on the other hand the spark filestream adapter seems to care as much about filenames as timestamps - hence NEW files with OLD time stamps will NOT be processed - yuk The hack you can use is to a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:09 PM To: user@spark.apache.org Subject: spark filestream problem it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem -tp22743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: spark filestream problem
I have figured it out in the meantime - simply when moving file on HDFS it preserves its time stamp and on the other hand the spark filestream adapter seems to care as much about filenames as timestamps - hence NEW files with OLD time stamps will NOT be processed - yuk The hack you can use is to a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:07 PM To: user@spark.apache.org Subject: spark filestream problem it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-filestream-problem -tp22742.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Submit Kill Spark Application program programmatically from another application
Hi, I am wondering if it is possible to submit, monitor kill spark applications from another service. I have wrote a service this: parse user commands translate them into understandable arguments to an already prepared Spark-SQL application submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder run generated applications' driver in cluster mode. The above 4 steps has been finished, but I have difficulties in these two: Query about the applications status, for example, the percentage completion. Kill queries accordingly What I find in spark standalone documentation suggest kill application using: ./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID And should find the driver ID through the standalone Master web UI at http://master url:8080. Are there any programmatically methods I could get the driverID submitted by my `ProcessBuilder` and query status about the query? Any Suggestions? — Best Regards! Yijie Shen
Re: Can I group elements in RDD into different groups and let each group share some elements?
Did you look at the cogroup transformation or the cartesian transformation ? Regards, Olivier. Le sam. 2 mai 2015 à 22:01, Franz Chien franzj...@gmail.com a écrit : Hi all, Can I group elements in RDD into different groups and let each group share elements? For example, I have 10,000 elements in RDD from e1 to e1, and I want to group and aggregate them by another mapping with size of 2000, ex: ( (e1,e42), (e1,e554), (e3, e554)…… (2000th group)) My first approach was to filter the RDD with mapping rules for 2000 times, and then union them together. However, it ran forever. Does SPARK provide a way to group elements in RDD like this please? Thanks, Franz
Re: real time Query engine Spark-SQL on Hbase
In the upcoming 1.4.0 release, SPARK-3468 should give you better clue. Cheers On Fri, May 1, 2015 at 12:30 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, Thanks for the reply. Hbase cli takes less than 500 ms for the same query. I am running a simple query i.t Select * from Customers where c_id='123123'. Why would the same query which takes 500 ms at Hbase cli end up taking around 8 secs via Spark-Sql? I am unable t understand this. Thanks, Siddharth -- *From:* ayan guha guha.a...@gmail.com *Sent:* 01 May 2015 04:38 *To:* Ted Yu *Cc:* user@spark.apache.org; Siddharth Ubale; matei.zaha...@gmail.com; Prakash Hosalli; Amit Kumar *Subject:* Re: real time Query engine Spark-SQL on Hbase And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote: bq. a single query on one filter criteria Can you tell us more about your filter ? How selective is it ? Which hbase release are you using ? Cheers On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, I want to use Spark as Query engine on HBase with sub second latency. I am using Spark 1.3 version. And followed the steps below on Hbase table with around 3.5 lac rows : *1. *Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase table which is used to create the Dataframe. *“ DataFrame schemaCustomers = sqlInstance* * .createDataFrame(SparkContextImpl.getRddCustomers(),* * Customers.class);” * 2. Used registertemp table i.e” *schemaCustomers.registerTempTable(customers);”* 3. Running the query on Dataframe using Sqlcontext Instance. What I am observing is that for a single query on one filter criteria the query is taking 7-8 seconds? And the time increases as I am increasing the number of rows in Hbase table. Also, there was one time when I was getting query response under 1-2 seconds. Seems like strange behavior. Is this expected behavior from Spark or am I missing something here? Can somebody help me understand this scenario . Please assist. Thanks, Siddharth Ubale,
Re: ClassNotFoundException for Kryo serialization
Now I am running up against some other problem while trying to schedule tasks: 15/05/01 22:32:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2419) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) I verified that the same configuration works without using Kryo serialization. On Fri, May 1, 2015 at 9:44 AM, Akshat Aranya aara...@gmail.com wrote: I cherry-picked the fix for SPARK-5470 and the problem has gone away. On Fri, May 1, 2015 at 9:15 AM, Akshat Aranya aara...@gmail.com wrote: Yes, this class is present in the jar that was loaded in the classpath of the executor Java process -- it wasn't even lazily added as a part of the task execution. Schema$MyRow is a protobuf-generated class. After doing some digging around, I think I might be hitting up against SPARK-5470, the fix for which hasn't been merged into 1.2, as far as I can tell. On Fri, May 1, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: bq. Caused by: java.lang.ClassNotFoundException: com.example.Schema$MyRow So the above class is in the jar which was in the classpath ? Can you tell us a bit more about Schema$MyRow ? On Fri, May 1, 2015 at 8:05 AM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm getting a ClassNotFoundException at the executor when trying to register a class for Kryo serialization: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:243) at org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:254) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:257) at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:182) at org.apache.spark.executor.Executor.init(Executor.scala:87) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:61) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:36) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Failed to load class to register with Kryo at
Re: to split an RDD to multiple ones?
I guess : val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_) val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_) val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_) Regards, Olivier. Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com a écrit : Hi, I have an RDD *srdd* containing (unordered-)data like this: s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, … What I want is (it will be much better if they could be in ascending order): *srdd_s1*: s1_0, s1_1, s1_2, …, s1_n *srdd_s2*: s2_0, s2_1, s2_2, …, s2_n *srdd_s3*: s3_0, s3_1, s3_2, …, s3_n … … Have any idea? Thanks in advance! :) Best, Yifan LI
Re: Drop a column from the DataFrame.
Sounds like a patch for a drop method... Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit : Just use select() to create a new DataFrame with only the columns you want. Sort of the opposite of what you want -- but you can select all but the columns you want minus the one you don. You could even use a filter to remove just the one column you want on the fly: myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname = new Column(colname)).toList : _* ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:
Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit : Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab
Re: Drop a column from the DataFrame.
This is coming in 1.4.0 https://issues.apache.org/jira/browse/SPARK-7280 On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote: Sounds like a patch for a drop method... Le sam. 2 mai 2015 à 21:03, dsgriffin dsgrif...@gmail.com a écrit : Just use select() to create a new DataFrame with only the columns you want. Sort of the opposite of what you want -- but you can select all but the columns you want minus the one you don. You could even use a filter to remove just the one column you want on the fly: myDF.select(myDF.columns.filter(_ != column_you_do_not_want).map(colname = new Column(colname)).toList : _* ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Drop-a-column-from-the-DataFrame-tp22711p22737.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: to split an RDD to multiple ones?
Thanks, Olivier and Franz. :) Best, Yifan LI On 02 May 2015, at 23:23, Olivier Girardot ssab...@gmail.com wrote: I guess : val srdd_s1 = srdd.filter(_.startsWith(s1_)).sortBy(_) val srdd_s2 = srdd.filter(_.startsWith(s2_)).sortBy(_) val srdd_s3 = srdd.filter(_.startsWith(s3_)).sortBy(_) Regards, Olivier. Le sam. 2 mai 2015 à 22:53, Yifan LI iamyifa...@gmail.com mailto:iamyifa...@gmail.com a écrit : Hi, I have an RDD srdd containing (unordered-)data like this: s1_0, s3_0, s2_1, s2_2, s3_1, s1_3, s1_2, … What I want is (it will be much better if they could be in ascending order): srdd_s1: s1_0, s1_1, s1_2, …, s1_n srdd_s2: s2_0, s2_1, s2_2, …, s2_n srdd_s3: s3_0, s3_1, s3_2, …, s3_n … … Have any idea? Thanks in advance! :) Best, Yifan LI
Re: Spark - Timeout Issues - OutOfMemoryError
You could try repartitioning your listings RDD, also doing a collectAsMap would basically bring all your data to driver, in that case you might want to set the storage level as Memory and disk not sure that will do any help on the driver though. Thanks Best Regards On Thu, Apr 30, 2015 at 11:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Full Exception *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at VISummaryDataProvider.scala:37) failed in 884.087 s* *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap at VISummaryDataProvider.scala:37, took 1093.418249 s* 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskresult_112)] org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskresult_112)] at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskresult_112)]) *Code at line 37* val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) } .collectAsMap Listing data set size is 26G (10 files) and my driver memory is 12G (I cant go beyond it). The reason i do collectAsMap is to brodcast it and do a map-side join instead of regular join. Please suggest ? On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: My Spark Job is failing and i see == 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw exception: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskresult_112)] org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error sending message [message = GetLocations(taskresult_112)] at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] I see multiple of these Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] And finally i see this java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:95) at
Can I group elements in RDD into different groups and let each group share some elements?
Hi all, Can I group elements in RDD into different groups and let each group share elements? For example, I have 10,000 elements in RDD from e1 to e1, and I want to group and aggregate them by another mapping with size of 2000, ex: ( (e1,e42), (e1,e554), (e3, e554)…… (2000th group)) My first approach was to filter the RDD with mapping rules for 2000 times, and then union them together. However, it ran forever. Does SPARK provide a way to group elements in RDD like this please? Thanks, Franz
Re: directory loader in windows
Thanks for answer. I am now trying to set HADOOP_HOME but the issues still persists. Also, I can see only windows-utils.exe in my HADDOP_HOME, but no WINUTILS.EXE. I do not have hadoop installed in my system, as I am not using HDFS, but I am using Spark 1.3.1 prebuilt with Hadoop 2.6. AM I missing something? Best Ayan On Tue, Apr 28, 2015 at 12:45 AM, Steve Loughran ste...@hortonworks.com wrote: This a hadoop-side stack trace it looks like the code is trying to get the filesystem permissions by running %HADOOP_HOME%\bin\WINUTILS.EXE ls -F and something is triggering a null pointer exception. There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not a known/fixed problem. At a guess, your environment HADOOP_HOME environment variable isn't point to the right place. If that's the case there should have been a warning in the logs Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.util.Shell.execCommand(Shell.java:808) at org.apache.hadoop.util.Shell.execCommand(Shell.java:791) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:557) at org.apache.hadoop.fs.LocatedFileStatus.init(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1699) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.collect(RDD.scala:813) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:374) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan Guha -- Best Regards, Ayan Guha