Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for
Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for
mention-bot.
On Friday I'll be doing my normal code reviews -
https://www.youtube.com/watch?v=O4rRx-3PTiM
On Monday July 30th @ 9:30am I'll be doing some more
On Thu, Jul 12, 2018 at 10:23 AM Arun Mahadevan wrote:
> Yes ForeachWriter [1] could be an option If you want to write to different
> sinks. You can put your custom logic to split the data into different sinks.
>
> The drawback here is that you cannot plugin existing sinks like Kafka and
> you
I'm trying to read json messages from kafka and store them in hdfs with spark
structured streaming.
I followed the example here:
https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html
and when my code looks like this:
df = spark \
.read \
I am not seeing what else I will be increasing in this case if I were to
write a set of rows to two different topics? which means for every row
there will be two I/O's to two different topics
if forEachBatch caches great! but again I would try and cache as little as
possible. If I duplicate rows
Any way you go you’ll increase something... ☺
Even with foreachBatch you would have to cache the DataFrame before submitting
each batch to each topic to avoid recomputing it (see
super duper :)
On Tue, Jul 24, 2018 at 7:11 PM, Patrick McCarthy <
pmccar...@dstillery.com.invalid> wrote:
> Thanks Byran. I think it was ultimately groupings that were too large -
> after setting spark.sql.shuffle.partitions to a much higher number I was
> able to get the UDF to execute.
>
> On
Thanks Byran. I think it was ultimately groupings that were too large -
after setting spark.sql.shuffle.partitions to a much higher number I was
able to get the UDF to execute.
On Fri, Jul 20, 2018 at 12:45 AM, Bryan Cutler wrote:
> Hi Patrick,
>
> It looks like it's failing in Scala before it
@Silvio Thought about duplicating rows but dropped the idea for increasing
memory. forEachBatch sounds Interesting!
On Mon, Jul 23, 2018 at 6:51 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:
> Using the current Kafka sink that supports routing based on topic column,
> you could just
Hi,
can you see whether using the option for checkPointLocation would work in
case you are using structured streaming?
Regards,
Gourav Sengupta
On Tue, Jul 24, 2018 at 12:30 PM, John, Vishal (Agoda) <
vishal.j...@agoda.com.invalid> wrote:
>
> Hello all,
>
>
> I have to read data from Kafka
Hello all,
I have to read data from Kafka topic at regular intervals. I create the
dataframe as shown below. I don’t want to start reading from the beginning on
each run. At the same time, I don’t want to miss the messages between run
intervals.
val queryDf = sqlContext
.read
10 matches
Mail list logo