Cannot create parquet with snappy output for hive external table
Hi Group, I am not able to load data into external hive table which is partitioned. Trace :- 1. create external table test(id int, name string) stored as parquet location 'hdfs://testcluster/user/abc/test' tblproperties ('PARQUET.COMPRESS'='SNAPPY'); 2.Spark code val spark = SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition", "true") .config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate() spark.sql("use default").show val rdd = sc.parallelize(Seq((1, "one"), (2, "two"))) val df = spark.createDataFrame(rdd).toDF("id", "name") df.write.mode(SaveMode.Overwrite).insertInto("test") 3. I can see few snappy.parquet files. 4. create external table test(id int) partitioned by (name string) stored as parquet location 'hdfs://testcluster/user/abc/test' tblproperties ('PARQUET.COMPRESS'='SNAPPY'); 5.Spark code val spark = SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition", "true") .config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate() spark.sql("use default").show val rdd = sc.parallelize(Seq((1, "one"), (2, "two"))) val df = spark.createDataFrame(rdd).toDF("id", "name") df.write.mode(SaveMode.Overwrite).insertInto("test") 6. I see uncompressed files without snappy.parquet extension. parquet-tools.jar also confirms that this is uncompressed parquet file. 7.i tried following options as well, but no luck df.write.mode(SaveMode.Overwrite).format("parquet").option("compression", "snappy").insertInto("test") Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-create-parquet-with-snappy-output-for-hive-external-table-tp28687.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: problems with checkpoint and spark sql
Hi David, You got any solution for this ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problems-with-checkpoint-and-spark-sql-tp26080p27773.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Bad Digest error while doing aws s3 put
Hi , I am getting the following error while reading the huge data from S3 and after processing ,writing data to S3 again. Did you find any solution for this ? 16/02/07 07:41:59 WARN scheduler.TaskSetManager: Lost task 144.2 in stage 3.0 (TID 169, ip-172-31-7-26.us-west-2.compute.internal): java.io.IOException: exception in uploadSinglePart at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:248) at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:469) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105) at org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106) at java.io.FilterOutputStream.close(FilterOutputStream.java:160) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109) at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1080) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: exception in putObject at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:149) at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy26.storeFile(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:245) ... 15 more Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match what we received. (Service: Amazon S3; Status Code: 400; Error Code: BadDigest; Request ID: 5918216A5901FCC8), S3 Extended Request ID: QSxtYln/yXqHYpdr4BWosin/TAFsGlK1FlKfE5PcuJkNrgoblGzTNt74kEhuNcrJCRZ3mXq0oUo= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182) at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3796) at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1482) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:140) ... 22 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p26167.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Explanation streaming-cep-engine with example
Hi, Can someone explain how spark streaming cep engine works ? How to use it with sample example? http://spark-packages.org/package/Stratio/streaming-cep-engine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Explanation-streaming-cep-engine-with-example-tp22218.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error while Insert data into hive table via spark
Hi, I have configured apache spark 1.3.0 with hive 1.0.0 and hadoop 2.6.0. I am able to create table and retrive data from hive tables via following commands ,but not able insert data into table. scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS newtable (key INT)"); scala> sqlContext.sql("select * from newtable").collect; 15/03/19 02:10:20 INFO parse.ParseDriver: Parsing command: select * from newtable 15/03/19 02:10:20 INFO parse.ParseDriver: Parse Completed 15/03/19 02:10:35 INFO scheduler.DAGScheduler: Job 0 finished: collect at SparkPlan.scala:83, took 13.826402 s res2: Array[org.apache.spark.sql.Row] = Array([1]) But I am not able to insert data into this table via spark shell. This command runs perfectly fine from hive shell. scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@294fa094 // scala> sqlContext.sql("INSERT INTO TABLE newtable SELECT 1"); scala> sqlContext.sql("INSERT INTO TABLE newtable values(1)"); 15/03/19 02:03:14 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/03/19 02:03:14 INFO metastore.ObjectStore: ObjectStore, initialize called 15/03/19 02:03:14 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/03/19 02:03:14 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/03/19 02:03:14 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/03/19 02:03:15 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/03/19 02:03:16 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/03/19 02:03:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/03/19 02:03:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/03/19 02:03:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/03/19 02:03:18 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 15/03/19 02:03:18 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 15/03/19 02:03:18 INFO metastore.ObjectStore: Initialized ObjectStore 15/03/19 02:03:19 INFO metastore.HiveMetaStore: Added admin role in metastore 15/03/19 02:03:19 INFO metastore.HiveMetaStore: Added public role in metastore 15/03/19 02:03:19 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty 15/03/19 02:03:20 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/03/19 02:03:20 INFO parse.ParseDriver: Parsing command: INSERT INTO TABLE newtable values(1) NoViableAltException(26@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:742) at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:40171) at org.apache.hadoop.hive.ql.parse.HiveParser.singleSelectStatement(HiveParser.java:38048) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37754) at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:37654) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:36898) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:36774) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1338) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfu
Re: No suitable driver found error, Create table in hive from spark sql
Found solution from one of the post found on internet. I updated spark/bin/compute-classpath.sh and added database connector jar into classpath. CLASSPATH="$CLASSPATH:/data/mysql-connector-java-5.1.14-bin.jar" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-suitable-driver-found-error-Create-table-in-hive-from-spark-sql-tp21714p21715.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
No suitable driver found error, Create table in hive from spark sql
No suitable driver found error, Create table in hive from spark sql. I am trying to execute following example. SPARKGIT: spark/examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala My setup :- hadoop 1.6,spark 1.2, hive 1.0, mysql server (installed via yum install mysql55w mysql55w-server) I can create tables in hive from hive command prompt. / hive> select * from person_parquet; OK Barack Obama M BillClinton M Hillary Clinton F Time taken: 1.945 seconds, Fetched: 3 row(s) / I am starting spark shell via following command:- ./spark-1.2.0-bin-hadoop2.4/bin/spark-shell --master spark://sparkmaster.company.com:7077 --jars /data/mysql-connector-java-5.1.14-bin.jar /scala> Class.forName("com.mysql.jdbc.Driver") res0: Class[_] = class com.mysql.jdbc.Driver scala> Class.forName("com.mysql.jdbc.Driver").newInstance res1: Any = com.mysql.jdbc.Driver@2dec8e27 scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@32ecf100 scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") 15/02/18 22:23:01 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS src (key INT, value STRING) 15/02/18 22:23:02 INFO parse.ParseDriver: Parse Completed 15/02/18 22:23:02 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/18 22:23:02 INFO metastore.ObjectStore: ObjectStore, initialize called 15/02/18 22:23:02 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/18 22:23:02 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/18 22:23:02 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/02/18 22:23:02 ERROR Datastore.Schema: Failed initialising database. No suitable driver found for jdbc:mysql://sparkmaster.company.com:3306/hive org.datanucleus.exceptions.NucleusDataStoreException: No suitable driver found for jdbc:mysql://sparkmaster.company.com:3306/hive at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) at org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.R
Re: ExecutorLostFailure (executor lost)
I also received the same error message quite few times while saving rdd to hdfs. I am using Spark 1.1.0 with hadoop 2.5 in yarn mode. If you see logs, you might find logs like followings. 14/10/10 14:20:21 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(6, sparkmaster.company.com, 44906, 0) with no recent heart beats: 71967ms exceeds 45000ms 14/10/10 14:31:15 INFO scheduler.TaskSetManager: Finished task 46.0 in stage 4.0 (TID 544) in 734650 ms on sparknode1.company.com (46/50) 14/10/10 14:55:31 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkmaster.company.com,44906) 14/10/10 14:55:31 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkmaster.company.com,44906) 14/10/10 14:55:31 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkmaster.company.com,44906) 14/10/10 14:55:31 INFO cluster.YarnClusterSchedulerBackend: Executor 6 disconnected, so removing it 14/10/10 14:55:31 ERROR cluster.YarnClusterScheduler: Lost executor 6 on sparkmaster.company.com: remote Akka client disassociated 14/10/10 14:55:31 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 from TaskSet 4.0 14/10/10 14:55:31 WARN scheduler.TaskSetManager: Lost task 45.0 in stage 4.0 (TID 543, sparkmaster.company.com): ExecutorLostFailure (executor lost) 14/10/10 14:55:31 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 10) 14/10/10 14:55:31 INFO storage.BlockManagerMasterActor: Trying to remove executor 6 from BlockManagerMaster. 14/10/10 14:55:31 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor 14/10/10 14:55:31 INFO scheduler.TaskSetManager: Starting task 45.1 in stage 4.0 (TID 548, sparknode1.company.com, PROCESS_LOCAL, 948 bytes) If you are using yarn ,it will reschedule it again and start further processing. You can try updating following attributes from spark-defaults.conf spark.core.connection.ack.wait.timeout 3600 spark.core.connection.auth.wait.timeout 3600 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ExecutorLostFailure-executor-lost-tp14117p16126.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Change number of workers and memory
I am having a spark cluster having some high performance nodes and others are having commodity specs (lower configuration). When I configure worker memory and instances in spark-env.sh, it reflects to all the nodes. Can I change SPARK_WORKER_MEMORY and SPARK_WORKER_INSTANCES properties per node/machine basis ? I am using Spark 1.1.0 version. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Change-number-of-workers-and-memory-tp14866.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error: type mismatch while Union
Thank you Aaron for pointing out problem. This only happens when I run this code in spark-shell but not when i submit the job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error: type mismatch while Union
I am using Spark version 1.0.2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13618.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
error: type mismatch while Union
Hi, I am getting type mismatch error while union operation. Can someone suggest solution ? / case class MyNumber(no: Int, secondVal: String) extends Serializable with Ordered[MyNumber] { override def toString(): String = this.no.toString + " " + this.secondVal override def compare(that: MyNumber): Int = this.no compare that.no override def compareTo(that: MyNumber): Int = this.no compare that.no def Equals(that: MyNumber): Boolean = { (this.no == that.no) && (that match { case MyNumber(n1, n2) => n1 == no && n2 == secondVal case _ => false }) } } val numbers = sc.parallelize(1 to 20, 10) val firstRdd = numbers.map(new MyNumber(_, "A")) val secondRDD = numbers.map(new MyNumber(_, "B")) val numberRdd = firstRdd .union(secondRDD ) :24: error: type mismatch; found : org.apache.spark.rdd.RDD[MyNumber] required: org.apache.spark.rdd.RDD[MyNumber] val numberRdd = onenumberRdd.union(anotherRDD)/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multiple spark shell sessions
Thanks Yana, I am able to execute application and command via another session, i also received another port for UI application. Thanks, Dhimant -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-spark-shell-sessions-tp13441p13459.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Multiple spark shell sessions
Hi, I am receiving following error while connecting the spark server via shell if one shell is already open. How can I open multiple sessions ? Does anyone know abt Workflow Engine/Job Server like apache oozie for spark ? / Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) Type in expressions to have them evaluated. Type :help for more information. 14/09/04 15:07:46 INFO spark.SecurityManager: Changing view acls to: root 14/09/04 15:07:46 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/09/04 15:07:46 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/09/04 15:07:47 INFO Remoting: Starting remoting 14/09/04 15:07:47 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sp...@sparkmaster.guavus.com:42236] 14/09/04 15:07:47 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sp...@sparkmaster.guavus.com:42236] 14/09/04 15:07:47 INFO spark.SparkEnv: Registering MapOutputTracker 14/09/04 15:07:47 INFO spark.SparkEnv: Registering BlockManagerMaster 14/09/04 15:07:47 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140904150747-4dcd 14/09/04 15:07:47 INFO storage.MemoryStore: MemoryStore started with capacity 294.9 MB. 14/09/04 15:07:47 INFO network.ConnectionManager: Bound socket to port 54453 with id = ConnectionManagerId(sparkmaster.guavus.com,54453) 14/09/04 15:07:47 INFO storage.BlockManagerMaster: Trying to register BlockManager 14/09/04 15:07:47 INFO storage.BlockManagerInfo: Registering block manager sparkmaster.guavus.com:54453 with 294.9 MB RAM 14/09/04 15:07:47 INFO storage.BlockManagerMaster: Registered BlockManager 14/09/04 15:07:47 INFO spark.HttpServer: Starting HTTP Server 14/09/04 15:07:47 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/09/04 15:07:47 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:48977 14/09/04 15:07:47 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.1.21:48977 14/09/04 15:07:47 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-0e45759a-2c58-439a-8e96-95b0bc1d6136 14/09/04 15:07:47 INFO spark.HttpServer: Starting HTTP Server 14/09/04 15:07:47 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/09/04 15:07:47 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:39962 14/09/04 15:07:48 INFO server.Server: jetty-8.y.z-SNAPSHOT 14/09/04 15:07:48 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.(SparkContext.scala:223) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) at $line3.$read$$iwC$$iwC.(:8) at $line3.$read$$iwC.(:14) at $line3.$read.(:16) at $line3.$read$.(:20) at $line3.$read$.() at $line3.$eval$.(:7) at $line3.$eval$.() at $line3.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala
error: type mismatch while assigning RDD to RDD val object
I am receiving following error in Spark-Shell while executing following code. /class LogRecrod(logLine: String) extends Serializable { val splitvals = logLine.split(","); val strIp: String = splitvals(0) val hostname: String = splitvals(1) val server_name: String = splitvals(2) }/ /var logRecordRdd: org.apache.spark.rdd.RDD[LogRecrod] = _/ / val sourceFile = sc.textFile("hdfs://192.168.1.30:9000/Data/Log_1406794333258.log", 2)/ 14/09/04 12:08:28 INFO storage.MemoryStore: ensureFreeSpace(179585) called with curMem=0, maxMem=309225062 14/09/04 12:08:28 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 175.4 KB, free 294.7 MB) sourceFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12 /scala> logRecordRdd = sourceFile.map(line => new LogRecrod(line))/ /:18: error: type mismatch; found : LogRecrod required: LogRecrod logRecordRdd = sourceFile.map(line => new LogRecrod(line))/ Any suggestions to resolve this problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-assigning-RDD-to-RDD-val-object-tp13429.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
java.io.NotSerializableException exception - custom Accumulator
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP extends AccumulatorParam[Vector] { def zero(v: Vector) : Vector = new Vector(new Array(v.data.size)) def addInPlace(v1: Vector, v2: Vector) : Vector = { for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i) return v1 } } def main(sc:SparkContext) { val vectorAcc = sc.accumulator(new Vector(Array(0, 0))) val accum = sc.accumulator(0) val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10) file.foreach(line => {println(line); accum+=1; vectorAcc.add(new Vector(Array(1,1 ))) ;}) println(accum.value) println(vectorAcc.value.data) println("=" ) } } -- scala> App.main(sc) 14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called with curMem=0, maxMem=308713881 14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 127.7 KB, free 294.3 MB) 14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process : 1 14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at :30 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at :30) with 11 output partitions (allowLocal=false) 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0 (foreach at :30) 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List() 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at textFile at :29), which has no missing parents 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at :30 org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$ at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-exception-custom-Accumulator-tp3971.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
java.io.NotSerializableException exception - custom Accumulator
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP extends AccumulatorParam[Vector] { def zero(v: Vector) : Vector = new Vector(new Array(v.data.size)) def addInPlace(v1: Vector, v2: Vector) : Vector = { for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i) return v1 } } def main(sc:SparkContext) { val vectorAcc = sc.accumulator(new Vector(Array(0, 0))) val accum = sc.accumulator(0) val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10) file.foreach(line => {println(line); accum+=1; vectorAcc.add(new Vector(Array(1,1 ))) ;}) println(accum.value) println(vectorAcc.value.data) println("=" ) } } -- scala> App.main(sc) 14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called with curMem=0, maxMem=308713881 14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 127.7 KB, free 294.3 MB) 14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process : 1 14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at :30 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at :30) with 11 output partitions (allowLocal=false) 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0 (foreach at :30) 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List() 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at textFile at :29), which has no missing parents 14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at :30 org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$ at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)