Re: [Structured Streaming] Avoiding multiple streaming queries

kant kodali Mon, 23 Jul 2018 01:43:31 -0700

understand each row has a topic column but can we write one row to multiple
topics?


On Thu, Jul 12, 2018 at 11:00 AM, Arun Mahadevan <ar...@apache.org> wrote:

> What I meant was the number of partitions cannot be varied with
> ForeachWriter v/s if you were to write to each sink using independent
> queries. Maybe this is obvious.
>
> I am not sure about the difference you highlight about the performance
> part. The commit happens once per micro batch and "close(null)" is invoked.
> You can batch your writes in the process and/or in the close. The guess the
> writes can still be atomic and decided by if “close” returns successfully
> or throws an exception.
>
> Thanks,
> Arun
>
> From: chandan prakash <chandanbaran...@gmail.com>
> Date: Thursday, July 12, 2018 at 10:37 AM
> To: Arun Iyer <ar...@apache.org>
> Cc: Tathagata Das <tathagata.das1...@gmail.com>, "ymaha...@snappydata.io"
> <ymaha...@snappydata.io>, "priy...@asperasoft.com" <priy...@asperasoft.com>,
> "user @spark" <user@spark.apache.org>
>
> Subject: Re: [Structured Streaming] Avoiding multiple streaming queries
>
> Thanks a lot Arun for your response.
> I got your point that existing sink plugins like kafka, etc can not be
> used.
> However I could not get the part : " you cannot scale the partitions for
> the sinks independently "
> Can you please rephrase the above part ?
>
> Also,
> I guess :
> using foreachwriter for multiple sinks will affect the performance because
> write will happen to a sink per record basis (after deciding a record
> belongs to which particular sink), where as in the current implementation
> all data under a RDD partition gets committed to the sink atomically in one
> go. Please correct me if I am wrong here.
>
>
>
> Regards,
> Chandan
>
> On Thu, Jul 12, 2018 at 10:53 PM Arun Mahadevan <ar...@apache.org> wrote:
>
>> Yes ForeachWriter [1] could be an option If you want to write to
>> different sinks. You can put your custom logic to split the data into
>> different sinks.
>>
>> The drawback here is that you cannot plugin existing sinks like Kafka and
>> you need to write the custom logic yourself and you cannot scale the
>> partitions for the sinks independently.
>>
>> [1] https://spark.apache.org/docs/2.1.2/api/java/org/apache/spark/sql/
>> ForeachWriter.html
>>
>> From: chandan prakash <chandanbaran...@gmail.com>
>> Date: Thursday, July 12, 2018 at 2:38 AM
>> To: Tathagata Das <tathagata.das1...@gmail.com>, "ymaha...@snappydata.io"
>> <ymaha...@snappydata.io>, "priy...@asperasoft.com" <
>> priy...@asperasoft.com>, "user @spark" <user@spark.apache.org>
>> Subject: Re: [Structured Streaming] Avoiding multiple streaming queries
>>
>> Hi,
>> Did anyone of you thought  about writing a custom foreach sink writer
>> which can decided which record should go to which sink (based on some
>> marker in record, which we can possibly annotate during transformation) and
>> then accordingly write to specific sink.
>> This will mean that:
>> 1. every custom sink writer will have connections to as many sinks as
>> many there are types of sink where records can go.
>> 2.  every record will be read once in the single query but can be written
>> to multiple sinks
>>
>> Do you guys see any drawback in this approach ?
>> One drawback off course there is that sink is supposed to write the
>> records as they are but we are inducing some intelligence here in the sink.
>> Apart from that any other issues do you see with this approach?
>>
>> Regards,
>> Chandan
>>
>>
>> On Thu, Feb 15, 2018 at 7:41 AM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Of course, you can write to multiple Kafka topics from a single query.
>>> If your dataframe that you want to write has a column named "topic" (along
>>> with "key", and "value" columns), it will write the contents of a row to
>>> the topic in that row. This automatically works. So the only thing you need
>>> to figure out is how to generate the value of that column.
>>>
>>> This is documented - https://spark.apache.org/docs/latest/structured-
>>> streaming-kafka-integration.html#writing-data-to-kafka
>>>
>>> Or am i misunderstanding the problem?
>>>
>>> TD
>>>
>>>
>>>
>>>
>>> On Tue, Feb 13, 2018 at 10:45 AM, Yogesh Mahajan <ymaha...@snappydata.io
>>> > wrote:
>>>
>>>> I had a similar issue and i think that’s where the structured streaming
>>>> design lacks.
>>>> Seems like Question#2 in your email is a viable workaround for you.
>>>>
>>>> In my case, I have a custom Sink backed by an efficient in-memory
>>>> column store suited for fast ingestion.
>>>>
>>>> I have a Kafka stream coming from one topic, and I need to classify the
>>>> stream based on schema.
>>>> For example, a Kafka topic can have three different types of schema
>>>> messages and I would like to ingest into the three different column
>>>> tables(having different schema) using my custom Sink implementation.
>>>>
>>>> Right now only(?) option I have is to create three streaming queries
>>>> reading the same topic and ingesting to respective column tables using
>>>> their Sink implementations.
>>>> These three streaming queries create underlying three
>>>> IncrementalExecutions and three KafkaSources, and three queries reading the
>>>> same data from the same Kafka topic.
>>>> Even with CachedKafkaConsumers at partition level, this is not an
>>>> efficient way to handle a simple streaming use case.
>>>>
>>>> One workaround to overcome this limitation is to have same schema for
>>>> all the messages in a Kafka partition, unfortunately this is not in our
>>>> control and customers cannot change it due to their dependencies on other
>>>> subsystems.
>>>>
>>>> Thanks,
>>>> http://www.snappydata.io/blog <http://snappydata.io>
>>>>
>>>> On Mon, Feb 12, 2018 at 5:54 PM, Priyank Shrivastava <
>>>> priy...@asperasoft.com> wrote:
>>>>
>>>>> I have a structured streaming query which sinks to Kafka.  This query
>>>>> has a complex aggregation logic.
>>>>>
>>>>>
>>>>> I would like to sink the output DF of this query to
>>>>> multiple Kafka topics each partitioned on a different ‘key’ column.  I
>>>>> don’t want to have multiple Kafka sinks for each of the
>>>>> different Kafka topics because that would mean running multiple streaming
>>>>> queries - one for each Kafka topic, especially since my aggregation logic
>>>>> is complex.
>>>>>
>>>>>
>>>>> Questions:
>>>>>
>>>>> 1.  Is there a way to output the results of a structured streaming
>>>>> query to multiple Kafka topics each with a different key column but 
>>>>> without
>>>>> having to execute multiple streaming queries?
>>>>>
>>>>>
>>>>> 2.  If not,  would it be efficient to cascade the multiple queries
>>>>> such that the first query does the complex aggregation and writes output
>>>>> to Kafka and then the other queries just read the output of the first 
>>>>> query
>>>>> and write their topics to Kafka thus avoiding doing the complex 
>>>>> aggregation
>>>>> again?
>>>>>
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>>
>>>>> Priyank
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>> --
>> Chandan Prakash
>>
>>
>
> --
> Chandan Prakash
>
>

Re: [Structured Streaming] Avoiding multiple streaming queries

Reply via email to