It is always possible that there will be extra jobs from failed batches.
However, for the file sink, only one set of files will make it into
_spark_metadata directory log.  This is how we get atomic commits even when
there are files in more than one directory.  When reading the files with
Spark, we'll detect this directory and use it instead of listStatus to find
the list of valid files.

On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> On another note, when it comes to checkpointing on structured streaming
>
> I noticed if I have  a stream running off s3 and I kill the process. The
> next time the process starts running it dulplicates the last record
> inserted. is that normal?
>
>
>
>
> So say I have streaming enabled on one folder "test" which only has two
> files "update1" and "update 2", then I kill the spark job using Ctrl+C.
> When I rerun the stream it picks up "update 2" again
>
> Is this normal? isnt ctrl+c a failure?
>
> I would expect checkpointing to know that update 2 was already processed
>
> Regards
> Sam
>
> On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
>> Thanks Micheal!
>>
>>
>>
>> On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497
>>>
>>> We should add this soon.
>>>
>>> On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> When trying to read a stream off S3 and I try and drop duplicates I get
>>>> the following error:
>>>>
>>>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>>> Append output mode not supported when there are streaming aggregations on
>>>> streaming DataFrames/DataSets;;
>>>>
>>>>
>>>> Whats strange if I use the batch "spark.read.json", it works
>>>>
>>>> Can I assume you cant drop duplicates in structured streaming
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>
>>>
>>
>

Reply via email to