Re: [sql] How to uniquely identify Dataframe?

Reynold Xin Mon, 30 Mar 2015 09:33:56 -0700

The only reason I can think of right now is that you might want to change
the config parameter to change the behavior of the optimizer and regenerate
the plan. However, maybe that's not a strong enough reasons to regenerate
the RDD everytime.



On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> This is because unlike SchemaRDD, DataFrame itself is no longer an RDD
> now. In the meanwhile, DataFrame.rdd is a function, which always returns a
> new RDD. I think you may use DataFrame.queryExecution.logical (the
> logical plan) as an ID. Maybe we should make it a "lazy val" rather than a
> "def". Personally I don't find a good reason that it has to be a "def", but
> maybe I miss something here.
>
> Filed JIRA ticket and PR for this:
>
> - https://issues.apache.org/jira/browse/SPARK-6608
> - https://github.com/apache/spark/pull/5265
>
> Cheng
>
>
> On 3/30/15 8:02 PM, Peter Rudenko wrote:
>
>> Hi i have some custom caching logic in my application. I need to identify
>> somehow Dataframe, to check whether i saw it previously. Here’s a problem:
>>
>> |scala> val data = sc.parallelize(1 to 1000) data:
>> org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
>> <console>:21 scala> data.id res0: Int = 0 scala> data.id res1: Int = 0
>> scala> val dataDF = data.toDF dataDF: org.apache.spark.sql.DataFrame = [_1:
>> int] scala> dataDF.rdd.id res3: Int = 2 scala> dataDF.rdd.id res4: Int =
>> 3 |
>>
>> For some reason it generates a new ID on each call. With schemaRDD i was
>> able to call SchemaRDD.id.
>>
>> Thanks,
>> Peter Rudenko
>>
>> 
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: [sql] How to uniquely identify Dataframe?

Reply via email to