Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design.

*RDD.checkpoint* returns void, which means it mutates the RDD state so you
need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed.

*Dataset.checkpoint* returns a new Dataset, which means there is no
isCheckpointed state in Dataset, and thus we don't need a
*Dataset.isCheckpointed* method.


On Wed, Oct 25, 2017 at 6:39 PM, Bernard Jesop 
wrote:

> Actually, I realized keeping the info would not be enough as I need to
> find back the checkpoint files to delete them :/
>
> 2017-10-25 19:07 GMT+02:00 Bernard Jesop :
>
>> As far as I understand, Dataset.rdd is not the same as InternalRDD.
>> It is just another RDD representation of the same Dataset and is created
>> on demand (lazy val) when Dataset.rdd is called.
>> This totally explains the observed behavior.
>>
>> But how would would it be possible to know that a Dataset have been
>> checkpointed?
>> Should I manually keep track of that info?
>>
>> 2017-10-25 15:51 GMT+02:00 Bernard Jesop :
>>
>>> Hello everyone,
>>>
>>> I have a question about checkpointing on dataset.
>>>
>>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>>> RDD there is no Dataset.isCheckpointed().
>>>
>>> I wonder if Dataset.checkpoint is a syntactic sugar for
>>> Dataset.rdd.checkpoint.
>>> When I do :
>>>
>>> Dataset.checkpoint; Dataset.count
>>> Dataset.rdd.isCheckpointed // result: false
>>>
>>> However, when I explicitly do:
>>> Dataset.rdd.checkpoint; Dataset.rdd.count
>>> Dataset.rdd.isCheckpointed // result: true
>>>
>>> Could someone explain this behavior to me, or provide some references?
>>>
>>> Best regards,
>>> Bernard
>>>
>>
>>
>


Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
Actually, I realized keeping the info would not be enough as I need to find
back the checkpoint files to delete them :/

2017-10-25 19:07 GMT+02:00 Bernard Jesop :

> As far as I understand, Dataset.rdd is not the same as InternalRDD.
> It is just another RDD representation of the same Dataset and is created
> on demand (lazy val) when Dataset.rdd is called.
> This totally explains the observed behavior.
>
> But how would would it be possible to know that a Dataset have been
> checkpointed?
> Should I manually keep track of that info?
>
> 2017-10-25 15:51 GMT+02:00 Bernard Jesop :
>
>> Hello everyone,
>>
>> I have a question about checkpointing on dataset.
>>
>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>> RDD there is no Dataset.isCheckpointed().
>>
>> I wonder if Dataset.checkpoint is a syntactic sugar for
>> Dataset.rdd.checkpoint.
>> When I do :
>>
>> Dataset.checkpoint; Dataset.count
>> Dataset.rdd.isCheckpointed // result: false
>>
>> However, when I explicitly do:
>> Dataset.rdd.checkpoint; Dataset.rdd.count
>> Dataset.rdd.isCheckpointed // result: true
>>
>> Could someone explain this behavior to me, or provide some references?
>>
>> Best regards,
>> Bernard
>>
>
>


Re: Dataset API Question

2017-10-25 Thread Bernard Jesop
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on
demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.

But how would would it be possible to know that a Dataset have been
checkpointed?
Should I manually keep track of that info?

2017-10-25 15:51 GMT+02:00 Bernard Jesop :

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>


Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533

BTW this is basically writing all the data out, and then create a new
Dataset to load them in.


On Wed, Oct 25, 2017 at 6:51 AM, Bernard Jesop 
wrote:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>


Dataset API Question

2017-10-25 Thread Bernard Jesop
Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for
Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard


Re: Kicking off the process around Spark 2.2.1

2017-10-25 Thread Sean Owen
It would be reasonably consistent with the timing of other x.y.1 releases,
and more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau  wrote:

> Now that Spark 2.1.2 is out it seems like now is a good time to get
> started on the Spark 2.2.1 release. There are some streaming fixes I’m
> aware of that would be good to get into a release, is there anything else
> people are working on for 2.2.1 we should be tracking?
>
> To switch it up I’d like to suggest Felix to be the RM for this since
> there are also likely some R packaging changes to be included in the
> release. This also gives us a chance to see if my updated release
> documentation if enough for a new RM to get started from.
>
> What do folks think?
> --
> Twitter: https://twitter.com/holdenkarau
>


Kicking off the process around Spark 2.2.1

2017-10-25 Thread Holden Karau
Now that Spark 2.1.2 is out it seems like now is a good time to get started
on the Spark 2.2.1 release. There are some streaming fixes I’m aware of
that would be good to get into a release, is there anything else people are
working on for 2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there
are also likely some R packaging changes to be included in the release.
This also gives us a chance to see if my updated release documentation if
enough for a new RM to get started from.

What do folks think?
-- 
Twitter: https://twitter.com/holdenkarau


Re: CRAN SparkR package removed?

2017-10-25 Thread Holden Karau
Ok, so I’ll say it’s available in the CRAN “archive” and we hope to have it
fully available in future releases.

On Wed, Oct 25, 2017 at 9:46 AM Felix Cheung 
wrote:

> Yes - unfortunately something was found after it was published and made
> available publicly.
>
> We have a JIRA on this and are working on the best course of action.
>
>
> _
> From: Holden Karau 
> Sent: Wednesday, October 25, 2017 1:35 AM
> Subject: CRAN SparkR package removed?
> To: 
>
>
>
> Looking at https://cran.r-project.org/web/packages/SparkR/ it seems like
> the package has been removed. Any ideas what's up?
>
> (Just asking since I'm working on the release e-mail and it was also
> mentioned in the keynote just now).
>
> --
> Twitter: https://twitter.com/holdenkarau
>
>
> --
Twitter: https://twitter.com/holdenkarau


Re: CRAN SparkR package removed?

2017-10-25 Thread Felix Cheung
Yes - unfortunately something was found after it was published and made 
available publicly.

We have a JIRA on this and are working on the best course of action.


_
From: Holden Karau >
Sent: Wednesday, October 25, 2017 1:35 AM
Subject: CRAN SparkR package removed?
To: >


Looking at https://cran.r-project.org/web/packages/SparkR/ it seems like the 
package has been removed. Any ideas what's up?

(Just asking since I'm working on the release e-mail and it was also mentioned 
in the keynote just now).

--
Twitter: https://twitter.com/holdenkarau




CRAN SparkR package removed?

2017-10-25 Thread Holden Karau
Looking at https://cran.r-project.org/web/packages/SparkR/ it seems like
the package has been removed. Any ideas what's up?

(Just asking since I'm working on the release e-mail and it was also
mentioned in the keynote just now).

-- 
Twitter: https://twitter.com/holdenkarau


Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-25 Thread Chris Luby
I have an external catalog that has additional information on my Parquet files 
that I want to match up with the parsed filters from the plan to prune the list 
of files included in the scan.  I’m looking at doing this using the Spark 2.2.0 
SparkSession extensions similar to the built in partition pruning:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

and this other project that is along the lines of what I want:

https://github.com/lightcopy/parquet-index/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/IndexSourceStrategy.scala

but isn’t caught up to 2.2.0, but I’m struggling to understand what type of 
extension I would use to do something like the above:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSessionExtensions

and if this is the appropriate strategy for this.

Are there any examples out there for using the new extension hooks to alter the 
files included in the plan?

Thanks.