RE: Spark Mesos Dispatcher
1.3 does not have MesosDisptacher or does not have support for Mesos cluster mode , is it still possible to create a Dispatcher using 1.4 and run 1.3 using that dispatcher ? From: Jerry Lam [chiling...@gmail.com] Sent: Monday, July 20, 2015 8:27 AM To: Jahagirdar, Madhu Cc: user; d...@spark.apache.org Subject: Re: Spark Mesos Dispatcher Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu" mailto:madhu.jahagir...@philips.com>> wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Spark Mesos Dispatcher
All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Spark Drill 1.2.1 - error
All, We are getting the below error when we are using Drill JDBC driver with spark, please let us know what could be the issue. java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian cannot access its superclass io.netty.buffer.WrappedByteBuf at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:43) at org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:68) at org.apache.drill.jdbc.DrillConnectionImpl.(DrillConnectionImpl.java:91) at org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.(DrillJdbc41Factory.java:88) at org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57) at org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43) at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51) at net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:233) at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17) at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17) at org.apache.spark.rdd.JdbcRDD$$anon$1.(JdbcRDD.scala:76) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/26 10:16:03 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian cannot access its superclass io.netty.buffer.WrappedByteBuf at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:43) at org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:68) at org.apache.drill.jdbc.DrillConnectionImpl.(DrillConnectionImpl.java:91) at org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.(DrillJdbc41Factory.java:88) at org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57) at org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43) at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51) at net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:233) at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17) at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17) at org.apache.spark.rdd.JdbcRDD$$anon$1.(JdbcRDD.scala:76) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73) at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler. Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, di
Re: Can we say 1 RDD is generated every batch interval?
Foreach iterates through the partitions in the RDD and executes the operations for each partitions i guess. > On 29-Dec-2014, at 10:19 pm, SamyaMaiti wrote: > > Hi All, > > Please clarify. > > Can we say 1 RDD is generated every batch interval? > > If the above is true. Then, is the foreachRDD() operator executed one & only > once for each batch processing? > > Regards, > Sam > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
java.lang.ArithmeticException while create Parquet
All, This causes below error: However if i replace JavaHiveContext with Hive Context (see the commented code below) and replace JavaClass with CaseClass (Scala) the same code works ok. Any reason why this could be happening ? ava.lang.ArithmeticException: / by zero at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99) at parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:92) at parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.api.java.JavaHiveContext import org.apache.spark.streaming.{Duration, StreamingContext} import org.apache.spark.{Logging, SparkConf} object SparkStreamingToParquet extends Logging { /** * * @param args * @throws Exception */ def main(args: Array[String]) { //if (args.length < 6) { // logInfo("Please provide valid parameters: " //+ ""); // logInfo("make user you give full folder path with '/' at the end i.e /user/hdfs/abc/"); // System.exit(1); //} val CHECKPOINT_DIR = "hdfs://127.0.0.1:/user/checkpoint/" //args(6) val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, ()=>{ createContext(args) }) jssc.start jssc.awaitTermination } def createContext(args:Array[String]): StreamingContext = { val CHECKPOINT_DIR = "hdfs://127.0.0.1:/user/checkpoint" //args(6) val sparkConf: SparkConf = new SparkConf() val HDFS_URI = "hdfs://127.0.0.1:" //sparkConf.get("spark.philips.hdfsuri"); val HDFS_FILE_LOC = HDFS_URI+"/user/logs/" //args(0); // for streaming val IMPALA_TABLE_LOC = HDFS_URI+ "/user/impala/" //args(1); // impala table location val TEMP_TABLE_NAME = "temp_json" //args(2); // temp table name for hive // context val BEAN_CLASS_NAME = "Person" //args(3); val SPARK_APP_NAME = "Monitor" //args(4); sparkConf.setAppName(SPARK_APP_NAME).setMaster("local[2]") var noOfCores = "3"; //if(args.length>=6){ // noOfCores= args(5); //} sparkConf.set("spark.cores.max", noOfCores); val jssc: StreamingContext = new StreamingContext(sparkConf, new Duration(3)) val stream = jssc.textFileStream(HDFS_FILE_LOC) stream.foreachRDD(rdd => { if(rdd!=null && rdd.count()>0) { val hcontext = new JavaHiveContext(rdd.sparkContext) hcontext.createParquetFile(Class.forName(BEAN_CLASS_NAME),IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME); //hcontext.createParquetFile[Person(IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME); val schRdd = hcontext.jsonRDD(rdd) schRdd.insertInto(TEMP_TABLE_NAME) } }) jssc.checkpoint(CHECKPOINT_DIR) jssc } } The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
RE: Dynamically InferSchema From Hive and Create parquet file
Any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 12:28 PM To: Michael Armbrust Cc: u...@spark.incubator.apache.org Subject: RE: Dynamically InferSchema From Hive and Create parquet file When I create Hive table with Parquet format, it does not create any metadata until data in inserted. So data needs to be there before I infer the schema otherwise it throws error. Any workaround for this ? From: Michael Armbrust [mich...@databricks.com] Sent: Thursday, November 06, 2014 12:27 AM To: Jahagirdar, Madhu Cc: u...@spark.incubator.apache.org Subject: Re: Dynamically InferSchema From Hive and Create parquet file That method is for creating a new directory to hold parquet data when there is no hive metastore available, thus you have to specify the schema. If you've already created the table in the metastore you can just query it using the sql method: javahiveConxted.sql("SELECT * FROM parquetTable"); You can also load the data as a SchemaRDD without using the metastore since parquet is self describing: javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData") On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu mailto:madhu.jahagir...@philips.com>> wrote: Currently the createParquetMethod needs BeanClass as one of the parameters. javahiveContext.createParquetFile(XBean.class, IMPALA_TABLE_LOC, true, new Configuration()) .registerTempTable(TEMP_TABLE_NAME); Is it possible that we dynamically Infer Schema From Hive using hive context and the table name, then give that Schema ? Regards. Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: CheckPoint Issue with JsonRDD
Michael any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 2:36 PM To: mich...@databricks.com; user Subject: CheckPoint Issue with JsonRDD When we enable checkpoint and use JsonRDD we get the following error: Is this bug ? Exception in thread "main" java.lang.NullPointerException at org.apache.spark.rdd.RDD.(RDD.scala:125) at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103) at org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) = import org.apache.hadoop.conf.Configuration import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{Logging, SparkConf} import org.apache.spark.sql.api.java.JavaSchemaRDD import org.apache.spark.sql.hive.api.java.JavaHiveContext import org.apache.spark.streaming.api.java.JavaStreamingContext import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} object SparkStreamingToParquet extends Logging { /** * * @param args * @throws Exception */ def main(args: Array[String]) { if (args.length < 3) { logInfo("Please provide valid parameters: ") logInfo("make user you give full folder path with '/' at the end i.e /user/hdfs/abc/") System.exit(1) } val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, ()=>{ createContext(args) }) jssc.start jssc.awaitTermination } def createContext(args:Array[String]): StreamingContext = { val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val sparkConf: SparkConf = new SparkConf().setAppName("Json to Parquet").set("spark.cores.max", "3") val jssc: StreamingContext = new StreamingContext(sparkConf, new Duration(3)) val hivecontext: HiveContext = new HiveContext(jssc.sparkContext) hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME); val schemaString = "name age" val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val textFileStream = jssc.textFileStream(HDFS_FILE_LOC) textFileStream.foreachRDD(rdd => { if(rdd !=null && rdd.count()>0) { val schRdd = hivecontext.jsonRDD(rdd,schema) logInfo("inserting into table: " + TEMP_TABLE_NAME) schRdd.insertInto(TEMP_TABLE_NAME) } }) jssc.checkpoint(CHECKPOINT_DIR) jssc } } case class Person(name:String, age:String) extends Serializable Regards, Madhu jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
CheckPoint Issue with JsonRDD
When we enable checkpoint and use JsonRDD we get the following error: Is this bug ? Exception in thread "main" java.lang.NullPointerException at org.apache.spark.rdd.RDD.(RDD.scala:125) at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103) at org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) = import org.apache.hadoop.conf.Configuration import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{Logging, SparkConf} import org.apache.spark.sql.api.java.JavaSchemaRDD import org.apache.spark.sql.hive.api.java.JavaHiveContext import org.apache.spark.streaming.api.java.JavaStreamingContext import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} object SparkStreamingToParquet extends Logging { /** * * @param args * @throws Exception */ def main(args: Array[String]) { if (args.length < 3) { logInfo("Please provide valid parameters: ") logInfo("make user you give full folder path with '/' at the end i.e /user/hdfs/abc/") System.exit(1) } val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, ()=>{ createContext(args) }) jssc.start jssc.awaitTermination } def createContext(args:Array[String]): StreamingContext = { val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val sparkConf: SparkConf = new SparkConf().setAppName("Json to Parquet").set("spark.cores.max", "3") val jssc: StreamingContext = new StreamingContext(sparkConf, new Duration(3)) val hivecontext: HiveContext = new HiveContext(jssc.sparkContext) hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME); val schemaString = "name age" val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val textFileStream = jssc.textFileStream(HDFS_FILE_LOC) textFileStream.foreachRDD(rdd => { if(rdd !=null && rdd.count()>0) { val schRdd = hivecontext.jsonRDD(rdd,schema) logInfo("inserting into table: " + TEMP_TABLE_NAME) schRdd.insertInto(TEMP_TABLE_NAME) } }) jssc.checkpoint(CHECKPOINT_DIR) jssc } } case class Person(name:String, age:String) extends Serializable Regards, Madhu jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
RE: Dynamically InferSchema From Hive and Create parquet file
When I create Hive table with Parquet format, it does not create any metadata until data in inserted. So data needs to be there before I infer the schema otherwise it throws error. Any workaround for this ? From: Michael Armbrust [mich...@databricks.com] Sent: Thursday, November 06, 2014 12:27 AM To: Jahagirdar, Madhu Cc: u...@spark.incubator.apache.org Subject: Re: Dynamically InferSchema From Hive and Create parquet file That method is for creating a new directory to hold parquet data when there is no hive metastore available, thus you have to specify the schema. If you've already created the table in the metastore you can just query it using the sql method: javahiveConxted.sql("SELECT * FROM parquetTable"); You can also load the data as a SchemaRDD without using the metastore since parquet is self describing: javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData") On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu mailto:madhu.jahagir...@philips.com>> wrote: Currently the createParquetMethod needs BeanClass as one of the parameters. javahiveContext.createParquetFile(XBean.class, IMPALA_TABLE_LOC, true, new Configuration()) .registerTempTable(TEMP_TABLE_NAME); Is it possible that we dynamically Infer Schema From Hive using hive context and the table name, then give that Schema ? Regards. Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Dynamically InferSchema From Hive and Create parquet file
Currently the createParquetMethod needs BeanClass as one of the parameters. javahiveContext.createParquetFile(XBean.class, IMPALA_TABLE_LOC, true, new Configuration()) .registerTempTable(TEMP_TABLE_NAME); Is it possible that we dynamically Infer Schema From Hive using hive context and the table name, then give that Schema ? Regards. Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Issue with Spark Twitter Streaming
All, We are using Spark Streaming to receive data from twitter stream. This is running behind proxy. We have done the following configurations inside spark steaming for twitter4j to work behind proxy. def main(args: Array[String]) { val filters = Array("Modi") System.setProperty("twitter4j.oauth.consumerKey", "*") System.setProperty("twitter4j.oauth.consumerSecret", "*") System.setProperty("twitter4j.oauth.accessToken", "*") System.setProperty("twitter4j.oauth.accessTokenSecret", "*") System.setProperty("twitter4j.http.proxyHost", "X.X.X.X"); System.setProperty("twitter4j.http.proxyPort", ""); System.setProperty("twitter4j.http.useSSL", "true"); val conf = new SparkConf().setAppName("TwitterPopularTags") val ssc = new StreamingContext(conf, Seconds(60)) val stream = TwitterUtils.createStream(ssc, None, filters) stream.print() ssc.start() ssc.awaitTermination() } spark-streaming-twitter_2.10-1.1.0 twitter4j-core-3.0.3.jar twitter4j-stream-3.0.3.jar When the spark job is run with local[2], running on a single node and not on cluster, with the same settings above it is able to pull the data and it works like charm behind proxy. The same code when run on a cluster (below) on the same network with the above settings it is throwing the below error. Not sure what is going wrong. Any help is appreciated. We checked that environment variables of executors, all the above system properties are set. bin/spark-submit --class SparkTwitter2Kafka --master spark://IPADDRESS:7077 spark-twitter.jar 14/10/13 14:00:10 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error receiving tweets - connect timed out Relevant discussions can be found on the Internet at: http://www.google.co.jp/search?q=944a924a or http://www.google.co.jp/search?q=24fd66dc TwitterException{exceptionCode=[944a924a-24fd66dc 944a924a-24fd66b2], statusCode=-1, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, version=3.0.5} at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177) at twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61) at twitter4j.internal.http.HttpClientWrapper.post(HttpClientWrapper.java:98) at twitter4j.TwitterStreamImpl.getFilterStream(TwitterStreamImpl.java:304) at twitter4j.TwitterStreamImpl$7.getStream(TwitterStreamImpl.java:292) at twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:462) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275) at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250) at twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:135) ... 5 more The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
RE: Dstream Transformations
Doesn't spark keep track of the DAG lineage and start from where it has stopped ? Does it have to always start from the beginning of the lineage when the job fails ? From: Massimiliano Tomassi [max.toma...@gmail.com] Sent: Monday, October 06, 2014 2:40 PM To: Jahagirdar, Madhu Cc: Akhil Das; user Subject: Re: Dstream Transformations >From the Spark Streaming Programming Guide >(http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node): ...output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. I think that when a worker fails the entire graph of transformations/actions will be reapplied again on that RDD. This means that, in your case, both the storing operations will be executed again. For this reason, in a video I've watched on youtube, they suggest to make all the output operations idempotent. Obviously not always this is possible unfortunately: e.g. you are building an analytics system and you need to increment counters. This is what I've got so far, anyone having a different point of view? On 6 October 2014 08:59, Jahagirdar, Madhu mailto:madhu.jahagir...@philips.com>> wrote: Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar From: Akhil Das [ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>] Sent: Monday, October 06, 2014 1:20 PM To: Jahagirdar, Madhu Cc: user Subject: Re: Dstream Transformations AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in that case if one worker node goes down, then spark will try to recompute those lost RDDs again with those workers who are alive. Thanks Best Regards On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu mailto:madhu.jahagir...@philips.com>> wrote: In my spark streaming program I have created kafka utils to receive data and store data in elastic search and in flume. Storing function is applied on same dstream. My question what is the behavior of spark if after storing data in elastic search the worker node dies before storing in flume? Does it restart worker and then again store the data in elastic search and then flume or does it only run functions to store in flume. Regards Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org> -- Massimiliano Tomassi web: http://about.me/maxtomassi e-mail: max.toma...@gmail.com<mailto:max.toma...@gmail.com>
RE: Dstream Transformations
Given that I have multiple worker nodes and when Spark schedules the job again on the worker nodes that are alive, does it then again store the data in elastic search and then flume or does it only run functions to store in flume ? Regards, Madhu Jahagirdar From: Akhil Das [ak...@sigmoidanalytics.com] Sent: Monday, October 06, 2014 1:20 PM To: Jahagirdar, Madhu Cc: user Subject: Re: Dstream Transformations AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in that case if one worker node goes down, then spark will try to recompute those lost RDDs again with those workers who are alive. Thanks Best Regards On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu mailto:madhu.jahagir...@philips.com>> wrote: In my spark streaming program I have created kafka utils to receive data and store data in elastic search and in flume. Storing function is applied on same dstream. My question what is the behavior of spark if after storing data in elastic search the worker node dies before storing in flume? Does it restart worker and then again store the data in elastic search and then flume or does it only run functions to store in flume. Regards Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
Dstream Transformations
In my spark streaming program I have created kafka utils to receive data and store data in elastic search and in flume. Storing function is applied on same dstream. My question what is the behavior of spark if after storing data in elastic search the worker node dies before storing in flume? Does it restart worker and then again store the data in elastic search and then flume or does it only run functions to store in flume. Regards Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark RDD Disk Persistance
Should i use Disk based Persistance for RDD's and if the machine goes down during the program execution, next time when i rerun the program would the data be intact and not lost ? Regards, Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
hdfs short circuit
can i enable spark to use "dfs.client.read.shortcircuit" property to improve performance and ready natively on local nodes instead of hdfs api ? The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.