from:"Jahagirdar, Madhu"

RE: Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu

1.3 does not have MesosDisptacher or does not have support for Mesos cluster 
mode , is it still possible to create a Dispatcher using 1.4 and run 1.3 using 
that dispatcher ?

From: Jerry Lam [chiling...@gmail.com]
Sent: Monday, July 20, 2015 8:27 AM
To: Jahagirdar, Madhu
Cc: user; d...@spark.apache.org
Subject: Re: Spark Mesos Dispatcher

Yes.

Sent from my iPhone

On 19 Jul, 2015, at 10:52 pm, "Jahagirdar, Madhu" 
mailto:madhu.jahagir...@philips.com>> wrote:

All,

Can we run different version of Spark using the same Mesos Dispatcher. For 
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?

Regards,
Madhu Jahagirdar

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu

All,

Can we run different version of Spark using the same Mesos Dispatcher. For 
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Spark Drill 1.2.1 - error

2015-02-26 Thread Jahagirdar, Madhu

All,

We are getting the below error when we are using Drill JDBC driver with spark, 
please let us know what could be the issue.


java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian 
cannot access its superclass io.netty.buffer.WrappedByteBuf
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:43)
at 
org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:68)
at org.apache.drill.jdbc.DrillConnectionImpl.(DrillConnectionImpl.java:91)
at 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.(DrillJdbc41Factory.java:88)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43)
at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51)
at 
net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17)
at org.apache.spark.rdd.JdbcRDD$$anon$1.(JdbcRDD.scala:76)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/26 10:16:03 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian 
cannot access its superclass io.netty.buffer.WrappedByteBuf
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:43)
at 
org.apache.drill.exec.memory.TopLevelAllocator.(TopLevelAllocator.java:68)
at org.apache.drill.jdbc.DrillConnectionImpl.(DrillConnectionImpl.java:91)
at 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.(DrillJdbc41Factory.java:88)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43)
at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51)
at 
net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:17)
at org.apache.spark.rdd.JdbcRDD$$anon$1.(JdbcRDD.scala:76)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.
Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, di

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Jahagirdar, Madhu

Foreach iterates through the partitions in the RDD and executes the operations 
for each partitions i guess.

> On 29-Dec-2014, at 10:19 pm, SamyaMaiti  wrote:
>
> Hi All,
>
> Please clarify.
>
> Can we say 1 RDD is generated every batch interval?
>
> If the above is true. Then, is the foreachRDD() operator executed one & only
> once for each batch processing?
>
> Regards,
> Sam
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

java.lang.ArithmeticException while create Parquet

2014-11-17 Thread Jahagirdar, Madhu

All,

This causes below error:

 However if i replace JavaHiveContext with Hive Context (see the commented code 
below) and replace JavaClass with CaseClass (Scala) the same code works ok. Any 
reason why this could be happening ?


ava.lang.ArithmeticException: / by zero

at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99)

at 
parquet.hadoop.InternalParquetRecordWriter.(InternalParquetRecordWriter.java:92)

at parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:64)

at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)

at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)

at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)

at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)

at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.{Duration, StreamingContext}
import org.apache.spark.{Logging, SparkConf}



object SparkStreamingToParquet extends Logging {

  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
//if (args.length < 6) {
//  logInfo("Please provide valid parameters:   "
//+ "");
//  logInfo("make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/");
//  System.exit(1);
//}


val CHECKPOINT_DIR = "hdfs://127.0.0.1:/user/checkpoint/" //args(6)


val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()=>{
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

 val CHECKPOINT_DIR = "hdfs://127.0.0.1:/user/checkpoint" //args(6)

val sparkConf: SparkConf = new SparkConf()

val HDFS_URI = "hdfs://127.0.0.1:" 
//sparkConf.get("spark.philips.hdfsuri");

val HDFS_FILE_LOC = HDFS_URI+"/user/logs/" //args(0); // for streaming
val IMPALA_TABLE_LOC = HDFS_URI+ "/user/impala/" //args(1); // impala table 
location
val TEMP_TABLE_NAME = "temp_json" //args(2); // temp table name for hive

// context
val  BEAN_CLASS_NAME = "Person" //args(3);
val  SPARK_APP_NAME  = "Monitor" //args(4);

sparkConf.setAppName(SPARK_APP_NAME).setMaster("local[2]")

var noOfCores = "3";

//if(args.length>=6){
//  noOfCores= args(5);
//}

   sparkConf.set("spark.cores.max", noOfCores);

   val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val stream = jssc.textFileStream(HDFS_FILE_LOC)

stream.foreachRDD(rdd => {
if(rdd!=null && rdd.count()>0) {

  val hcontext = new JavaHiveContext(rdd.sparkContext)
  
hcontext.createParquetFile(Class.forName(BEAN_CLASS_NAME),IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);
 
//hcontext.createParquetFile[Person(IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);
  val schRdd = hcontext.jsonRDD(rdd)
  schRdd.insertInto(TEMP_TABLE_NAME)
   }
})

jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}




The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

RE: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Jahagirdar, Madhu

Any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 12:28 PM
To: Michael Armbrust
Cc: u...@spark.incubator.apache.org
Subject: RE: Dynamically InferSchema From Hive and Create parquet file

When I create Hive table with Parquet format, it does not create any metadata 
until data in inserted. So data needs to be there before I infer the schema 
otherwise it throws error. Any workaround for this ?

From: Michael Armbrust [mich...@databricks.com]
Sent: Thursday, November 06, 2014 12:27 AM
To: Jahagirdar, Madhu
Cc: u...@spark.incubator.apache.org
Subject: Re: Dynamically InferSchema From Hive and Create parquet file

That method is for creating a new directory to hold parquet data when there is 
no hive metastore available, thus you have to specify the schema.

If you've already created the table in the metastore you can just query it 
using the sql method:

javahiveConxted.sql("SELECT * FROM parquetTable");

You can also load the data as a SchemaRDD without using the metastore since 
parquet is self describing:

javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData")

On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
Currently the createParquetMethod needs BeanClass as one of the parameters.

javahiveContext.createParquetFile(XBean.class,

IMPALA_TABLE_LOC, true, new Configuration())

.registerTempTable(TEMP_TABLE_NAME);

Is it possible that we dynamically Infer Schema From Hive using hive context 
and the table name, then give that Schema ?

Regards.

Madhu Jahagirdar

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu

Michael any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.rdd.RDD.(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length < 3) {
  logInfo("Please provide valid parameters:   
")
  logInfo("make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/")
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()=>{
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName("Json to 
Parquet").set("spark.cores.max", "3")

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = "name age"
val schema =
  StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd => {
  if(rdd !=null && rdd.count()>0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo("inserting into table: " + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

CheckPoint Issue with JsonRDD

2014-11-06 Thread Jahagirdar, Madhu

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.rdd.RDD.(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length < 3) {
  logInfo("Please provide valid parameters:   
")
  logInfo("make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/")
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()=>{
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName("Json to 
Parquet").set("spark.cores.max", "3")

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = "name age"
val schema =
  StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd => {
  if(rdd !=null && rdd.count()>0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo("inserting into table: " + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

RE: Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Jahagirdar, Madhu

When I create Hive table with Parquet format, it does not create any metadata 
until data in inserted. So data needs to be there before I infer the schema 
otherwise it throws error. Any workaround for this ?

From: Michael Armbrust [mich...@databricks.com]
Sent: Thursday, November 06, 2014 12:27 AM
To: Jahagirdar, Madhu
Cc: u...@spark.incubator.apache.org
Subject: Re: Dynamically InferSchema From Hive and Create parquet file

That method is for creating a new directory to hold parquet data when there is 
no hive metastore available, thus you have to specify the schema.

If you've already created the table in the metastore you can just query it 
using the sql method:

javahiveConxted.sql("SELECT * FROM parquetTable");

You can also load the data as a SchemaRDD without using the metastore since 
parquet is self describing:

javahiveContext.parquetFile(".../path/to/parquetFiles").registerTempTable("parquetData")

On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
Currently the createParquetMethod needs BeanClass as one of the parameters.

javahiveContext.createParquetFile(XBean.class,

IMPALA_TABLE_LOC, true, new Configuration())

.registerTempTable(TEMP_TABLE_NAME);

Is it possible that we dynamically Infer Schema From Hive using hive context 
and the table name, then give that Schema ?

Regards.

Madhu Jahagirdar

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Jahagirdar, Madhu

Currently the createParquetMethod needs BeanClass as one of the parameters.


javahiveContext.createParquetFile(XBean.class,


IMPALA_TABLE_LOC, true, new Configuration())


.registerTempTable(TEMP_TABLE_NAME);


Is it possible that we dynamically Infer Schema From Hive using hive context 
and the table name, then give that Schema ?


Regards.

Madhu Jahagirdar







The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Issue with Spark Twitter Streaming

2014-10-13 Thread Jahagirdar, Madhu

All,

We are using Spark Streaming to receive data from twitter stream.  This is 
running behind proxy. We have done the following configurations inside spark 
steaming for twitter4j to work behind proxy.

def main(args: Array[String]) {
val filters =  Array("Modi")

System.setProperty("twitter4j.oauth.consumerKey", "*")
System.setProperty("twitter4j.oauth.consumerSecret", "*")
System.setProperty("twitter4j.oauth.accessToken", "*")
System.setProperty("twitter4j.oauth.accessTokenSecret", "*")
System.setProperty("twitter4j.http.proxyHost", "X.X.X.X");
System.setProperty("twitter4j.http.proxyPort", "");
System.setProperty("twitter4j.http.useSSL", "true");

val conf = new SparkConf().setAppName("TwitterPopularTags")

val ssc = new StreamingContext(conf, Seconds(60))
val stream = TwitterUtils.createStream(ssc, None, filters)

stream.print()

ssc.start()
ssc.awaitTermination()
  }

spark-streaming-twitter_2.10-1.1.0
twitter4j-core-3.0.3.jar
twitter4j-stream-3.0.3.jar


When the spark job is run with local[2], running on a single node and not on 
cluster, with the same settings above it is able to pull the data and it works 
like charm behind proxy.

The same code when run on a cluster (below) on the same network with the above 
settings it is throwing the below error. Not sure what is going wrong. Any help 
is appreciated. We checked that environment variables of executors, all the 
above system properties are set.

bin/spark-submit --class SparkTwitter2Kafka --master spark://IPADDRESS:7077 
spark-twitter.jar

14/10/13 14:00:10 ERROR scheduler.ReceiverTracker: Deregistered receiver for 
stream 0: Restarting receiver with delay 2000ms: Error receiving tweets - 
connect timed out
Relevant discussions can be found on the Internet at:
http://www.google.co.jp/search?q=944a924a or
http://www.google.co.jp/search?q=24fd66dc
TwitterException{exceptionCode=[944a924a-24fd66dc 944a924a-24fd66b2], 
statusCode=-1, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, 
version=3.0.5}
at 
twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
at 
twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)
at 
twitter4j.internal.http.HttpClientWrapper.post(HttpClientWrapper.java:98)
at 
twitter4j.TwitterStreamImpl.getFilterStream(TwitterStreamImpl.java:304)
at 
twitter4j.TwitterStreamImpl$7.getStream(TwitterStreamImpl.java:292)
at 
twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:462)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at 
sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at 
sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275)
at 
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at 
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250)
at 
twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:135)
... 5 more






The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu

Doesn't spark keep track of the DAG lineage and start from where it has stopped 
? Does it have to always start from the beginning of the lineage when the job 
fails ?

From: Massimiliano Tomassi [max.toma...@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To: Jahagirdar, Madhu
Cc: Akhil Das; user
Subject: Re: Dstream Transformations

>From the Spark Streaming Programming Guide 
>(http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node):

...output operations (like foreachRDD) have at-least once semantics, that is, 
the transformed data may get written to an external entity more than once in 
the event of a worker failure.

I think that when a worker fails the entire graph of transformations/actions 
will be reapplied again on that RDD. This means that, in your case, both the 
storing operations will be executed again. For this reason, in a video I've 
watched on youtube, they suggest to make all the output operations idempotent. 
Obviously not always this is possible unfortunately: e.g. you are building an 
analytics system and you need to increment counters.

This is what I've got so far, anyone having a different point of view?

On 6 October 2014 08:59, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
Given that I have multiple worker nodes and when Spark schedules the job again 
on the worker nodes that are alive, does it then again store the data in 
elastic search and then flume or does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar

From: Akhil Das [ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker 
nodes and in that case if one worker node goes down, then spark will try to 
recompute those lost RDDs again with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar

The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

--

Massimiliano Tomassi

web: http://about.me/maxtomassi
e-mail: max.toma...@gmail.com<mailto:max.toma...@gmail.com>

RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu

Given that I have multiple worker nodes and when Spark schedules the job again 
on the worker nodes that are alive, does it then again store the data in 
elastic search and then flume or does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar


From: Akhil Das [ak...@sigmoidanalytics.com]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker 
nodes and in that case if one worker node goes down, then spark will try to 
recompute those lost RDDs again with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu 
mailto:madhu.jahagir...@philips.com>> wrote:
In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Dstream Transformations

2014-10-04 Thread Jahagirdar, Madhu

In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark RDD Disk Persistance

2014-07-07 Thread Jahagirdar, Madhu

Should i use Disk based Persistance for RDD's and if the machine goes down 
during the program execution, next time when i rerun the program would the data 
be intact and not lost ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu

can i enable spark to use "dfs.client.read.shortcircuit" property to improve 
performance and ready natively on local nodes instead of hdfs api ?


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

RE: Spark Mesos Dispatcher

Spark Mesos Dispatcher

Spark Drill 1.2.1 - error

Re: Can we say 1 RDD is generated every batch interval?

java.lang.ArithmeticException while create Parquet

RE: Dynamically InferSchema From Hive and Create parquet file

RE: CheckPoint Issue with JsonRDD

CheckPoint Issue with JsonRDD

RE: Dynamically InferSchema From Hive and Create parquet file

Dynamically InferSchema From Hive and Create parquet file

Issue with Spark Twitter Streaming

RE: Dstream Transformations

RE: Dstream Transformations

Dstream Transformations

Spark RDD Disk Persistance

hdfs short circuit

16 matches

Site Navigation

Mail list logo

Footer information