Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-17 Thread Jungtaek Lim
Bump again.

Unlike file stream sink which has lots of limitations and many of us have
been suggesting alternatives, file stream source is the only way if end
users want to read the data from files. No alternative unless they
introduce another ETL & storage (probably Kafka).

On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim 
wrote:

> Hi German,
>
> option 1 isn't about "deleting" the old files, as your input directory may
> be accessed by multiple queries. Kafka centralizes the maintenance of input
> data hence possible to apply retention without problem.
> option 1 is more about "hiding" the old files being read, so that end
> users "may" be able to delete the files once they ensure "all queries
> accessing the input directory" don't see the old files.
>
> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon 
> wrote:
>
>> HI Jungtaek,
>>
>> I have a question, aren't both approaches compatible?
>>
>> How I see it, I think It would be interesting to have a retention period
>> to delete old files and/or the possibility of indicating an offset
>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>
>> WDYT?
>>
>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim 
>> wrote:
>>
>>> (I'd like to keep the discussion thread focusing on the specific topic -
>>> let's initiate another discussion threads on different topics.)
>>>
>>> Thanks for the input. I'd like to emphasize that the point in discussion
>>> is the "latestFirst" option - the rationalization starts from
>>> growing metadata log issues. I hope your input is picking option 2, but
>>> could you please make clear your input represents OK to "replace" the
>>> "latestFirst" option with "starting from timestamp"?
>>>
>>>
>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal 
>>> wrote:
>>>
 If we compare file-stream source with other streaming sources such as
 Kafka, the current behavior is indeed incomplete.  Starting the streaming
 from a custom offset/particular point of time is something that is missing.
 Typically filestream sources don't have auto-deletion of the older
 data/files. In kafka we can define the retention period. So even if we use
 "Earliest" we won't end up reading from the time when the Kafka topic was
 created. On the other hand, streaming sources can hold very old files. It's
 very valid use-cases to read the bulk of the old files using a batch job
 until a particular timestamp. And then use streaming jobs for real-time
 updates.

 So having support where we can specify a timestamp. and we would
 consider files created post that timestamp can be useful.

 Another concern which we need to consider is the listing cost. is there
 any way we can avoid listing the entire base directory and then filtering
 out the new files. if the data is organized as partitions using date, will
 it help to list only those partitions where new files were added?


 On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> bump, is there any interest on this topic?
>
> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> (Just to add rationalization, you can refer the original mail thread
>> on dev@ list to see efforts on addressing problems in file stream
>> source / sink -
>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>> )
>>
>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> As I have been going through the various issues on metadata log
>>> growing, it's not only the issue of sink, but also the issue of source.
>>> Unlike sink metadata log which entries should be available to the
>>> readers, the source metadata log is only for the streaming query 
>>> starting
>>> from the checkpoint, hence in theory it should only memorize about
>>> minimal entries which prevent processing multiple times on the same 
>>> file.
>>>
>>> This is not applied to the file stream source, and I think it's
>>> because of the existence of the "latestFirst" option which I haven't 
>>> seen
>>> from any sources. The option works as reading files in "backward" order,
>>> which means Spark can read the oldest file and latest file together in a
>>> micro-batch, which ends up having to memorize all files previously read.
>>> The option can be changed during query restart, so even if the query is
>>> started with "latestFirst" being false, it's not safe to apply the 
>>> logic of
>>> minimizing entries to memorize, as the option can be changed to true and
>>> then we'll read files again.
>>>
>>> I'm seeing two approaches here:
>>>
>>> 1) apply "retention" - unlike "maxFileAge", the option would apply
>>> to latestFirst as 

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-17 Thread Xiao Li
https://issues.apache.org/jira/browse/SPARK-32609 got merged. This is to
fix a correctness bug in DSV2 of Spark 2.4. Please include it in the
upcoming Spark 2.4.7 release.

Thanks,

Xiao

On Sun, Aug 9, 2020 at 10:26 PM Prashant Sharma 
wrote:

> Thanks for letting us know. So this vote is cancelled in favor of RC2.
>
>
>
> On Sun, Aug 9, 2020 at 8:31 AM Takeshi Yamamuro 
> wrote:
>
>> Thanks for letting us know about the two issues above, Dongjoon.
>>
>> 
>> I've checked the release materials (signatures, tag, ...) and it looks
>> fine, too.
>> Also, I run the tests on my local Mac (java 1.8.0) with the options
>> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
>> -Psparkr`
>> and they passed.
>>
>> Bests,
>> Takeshi
>>
>>
>>
>> On Sun, Aug 9, 2020 at 11:06 AM Dongjoon Hyun 
>> wrote:
>>
>>> Another instance is SPARK-31703 which filed on May 13th and the PR
>>> arrived two days ago.
>>>
>>> [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on
>>> big endian platforms
>>> https://github.com/apache/spark/pull/29383
>>>
>>> It seems that the patch is already ready in this case.
>>> I raised the priority of SPARK-31703 to `Blocker` for both Apache Spark
>>> 2.4.7 and 3.0.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Sat, Aug 8, 2020 at 6:10 AM Holden Karau 
>>> wrote:
>>>
 I'm going to go ahead and vote -0 then based on that then.

 On Fri, Aug 7, 2020 at 11:36 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Unfortunately, there is an on-going discussion about the new decimal
> correctness.
>
> Although we fixed one correctness issue at master and backported it
> partially to 3.0/2.4, it turns out that it needs more patched to be
> complete.
>
> Please see https://github.com/apache/spark/pull/29125 for on-going
> discussion for both 3.0/2.4.
>
> [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with
> overflowed value
>
> I also confirmed that 2.4.7 RC1 is affected.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Aug 6, 2020 at 2:48 PM Sean Owen  wrote:
>
>> +1 from me. The same as usual. Licenses and sigs look OK, builds and
>> passes tests on a standard selection of profiles.
>>
>> On Thu, Aug 6, 2020 at 7:07 AM Prashant Sharma 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.7.
>> >
>> > The vote is open until Aug 9th at 9AM PST and passes if a majority
>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.7
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >
>> > There are currently no issues targeting 2.4.7 (try project = SPARK
>> AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
>> Progress"))
>> >
>> > The tag to be voted on is v2.4.7-rc1 (commit
>> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
>> > https://github.com/apache/spark/tree/v2.4.7-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found
>> at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> >
>> https://repository.apache.org/content/repositories/orgapachespark-1352/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/
>> >
>> > The list of bug fixes going into 2.4.7 can be found at the
>> following URL:
>> > https://s.apache.org/spark-v2.4.7-rc1
>> >
>> > This release is using the release script of the tag v2.4.7-rc1.
>> >
>> > FAQ
>> >
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate,
>> then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and
>> install
>> > the current RC and see if anything important breaks, in the
>> Java/Scala
>> > you can add the staging repository to your projects resolvers and
>> test
>> > with the RC (make sure to clean up the artifact cache before/after
>> so
>> > you don't end up building with an out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.7?
>> > ===