Yes, but the question here is why the context objects are marked serializable when they are not meant to be sent somewhere as bytes. I tried to answer that apparent inconsistency below.
On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Sorry for asking this rather naïve question. > > The notion of serialisation in Spark and where it can be serialised or > not. Does this generally refer to the concept of serialisation in the > context of data storage? > > In this context for example with reference to RDD operations is it process > of translating object state into a format that can be stored and > retrieved from memory buffer? > > Thanks > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote: > > It is the driver that has the info needed to schedule and manage > distributed jobs and that is by design. > > This is narrowly about using the HiveContext or SparkContext directly. Of > course SQL operations are distributed. > > > On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Hi Sean, > > Your point: > > "You can't use the HiveContext or SparkContext in a distribution > operation..." > > Is this because of design issue? > > Case in point if I created a DF from RDD and register it as a tempTable, > does this imply that any sql calls on that table islocalised and not > distributed among executors? > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote: > > Sean, thank you for making it clear. It was helpful. > > Regards, > Ajay > > > On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote: > > This usage is fine, because you are only using the HiveContext locally on > the driver. It's applied in a function that's used on a Scala collection. > > You can't use the HiveContext or SparkContext in a distribution operation. > It has nothing to do with for loops. > > The fact that they're serializable is misleading. It's there, I believe, > because these objects may be inadvertently referenced in the closure of a > function that executes remotely, yet doesn't use the context. The closure > cleaner can't always remove this reference. The task would fail to > serialize even though it doesn't use the context. You will find these > objects serialize but then don't work if used remotely. > > The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1 > IIRC. > > On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote: > > Hi Everyone, > > I was thinking if I can use hiveContext inside foreach like below, > > object Test { > def main(args: Array[String]): Unit = { > > val conf = new SparkConf() > val sc = new SparkContext(conf) > val hiveContext = new HiveContext(sc) > > val dataElementsFile = args(0) > val deDF = > hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache() > > def calculate(de: Row) { > val dataElement = de.getAs[String]("DataElement").trim > val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + > dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM > TEST_DB.TEST_TABLE1 ") > df1.write.insertInto("TEST_DB.TEST_TABLE1") > } > > deDF.collect().foreach(calculate) > } > } > > > I looked at > https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext > and I see it is extending SqlContext which extends Logging with Serializable. > > Can anyone tell me if this is the right way to use it ? Thanks for your time. > > Regards, > > Ajay > > > >