Re: HiveContext is Serialized?

Sean Owen Wed, 26 Oct 2016 01:28:56 -0700

Yes, but the question here is why the context objects are marked
serializable when they are not meant to be sent somewhere as bytes. I tried
to answer that apparent inconsistency below.


On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Sorry for asking this rather naïve question.
>
> The notion of serialisation in Spark and where it can be serialised or
> not. Does this generally refer to the concept of serialisation in the
> context of data storage?
>
> In this context for example with reference to RDD operations is it process
> of translating object state into a format that can be stored and
> retrieved from memory buffer?
>
> Thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 09:06, Sean Owen <so...@cloudera.com> wrote:
>
> It is the driver that has the info needed to schedule and manage
> distributed jobs and that is by design.
>
> This is narrowly about using the HiveContext or SparkContext directly. Of
> course SQL operations are distributed.
>
>
> On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi Sean,
>
> Your point:
>
> "You can't use the HiveContext or SparkContext in a distribution
> operation..."
>
> Is this because of design issue?
>
> Case in point if I created a DF from RDD and register it as a tempTable,
> does this imply that any sql calls on that table islocalised and not
> distributed among executors?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 October 2016 at 06:43, Ajay Chander <itsche...@gmail.com> wrote:
>
> Sean, thank you for making it clear. It was helpful.
>
> Regards,
> Ajay
>
>
> On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:
>
> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com> wrote:
>
> Hi Everyone,
>
> I was thinking if I can use hiveContext inside foreach like below,
>
> object Test {
>   def main(args: Array[String]): Unit = {
>
>     val conf = new SparkConf()
>     val sc = new SparkContext(conf)
>     val hiveContext = new HiveContext(sc)
>
>     val dataElementsFile = args(0)
>     val deDF = 
> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>
>     def calculate(de: Row) {
>       val dataElement = de.getAs[String]("DataElement").trim
>       val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
> TEST_DB.TEST_TABLE1 ")
>       df1.write.insertInto("TEST_DB.TEST_TABLE1")
>     }
>
>     deDF.collect().foreach(calculate)
>   }
> }
>
>
> I looked at 
> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>  and I see it is extending SqlContext which extends Logging with Serializable.
>
> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>
> Regards,
>
> Ajay
>
>
>
>

Re: HiveContext is Serialized?

Reply via email to