Got it, thanks!
2017-02-11 0:56 GMT-08:00 Sam Elamin :
> Here's a link to the thread
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-
> Streaming-Dropping-Duplicates-td20884.html
>
> On Sat, 11 Feb 2017 at 08:47, Sam Elamin
Here's a link to the thread
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html
On Sat, 11 Feb 2017 at 08:47, Sam Elamin wrote:
> Hey Egor
>
>
> You can use for each writer or you can write a custom sink. I
Hey Egor
You can use for each writer or you can write a custom sink. I personally
went with a custom sink since I get a dataframe per batch
https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala
You can have a look at
"Something like that" I've never tried it out myself so I'm only
guessing having a brief look at the API.
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Sat,
Jacek, so I create cache in ForeachWriter, in all "process()" I write to it
and on close I flush? Something like that?
2017-02-09 12:42 GMT-08:00 Jacek Laskowski :
> Hi,
>
> Yes, that's ForeachWriter.
>
> Yes, it works with element by element. You're looking for mapPartition
>
Hi,
Yes, that's ForeachWriter.
Yes, it works with element by element. You're looking for mapPartition
and ForeachWriter has partitionId that you could use to implement a
similar thing.
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0
Jacek, you mean
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter
? I do not understand how to use it, since it passes every value
separately, not every partition. And addding to table value by value would
not work
2017-02-07 12:10 GMT-08:00 Jacek
Hi,
Have you considered foreach sink?
Jacek
On 6 Feb 2017 8:39 p.m., "Egor Pahomov" wrote:
> Hi, I'm thinking of using Structured Streaming instead of old streaming,
> but I need to be able to save results to Hive table. Documentation for file
> sink
I presume you may be able to implement a custom sink and use
df.saveAsTable. The problem is that you will have to handle idempotence /
garbage collection yourself, in case your job fails while writing, etc.
On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov wrote:
> I have
I have stream of files on HDFS with JSON events. I need to convert it to
pq in realtime, process some fields and store in simple Hive table so
people can query it. People even might want to query it with Impala, so
it's important, that it would be real Hive metastore based table. How can I
do
Hi Egor,
Structured Streaming handles all of its metadata itself, which files are
actually valid, etc. You may use the "create table" syntax in SQL to treat
it like a hive table, but it will handle all partitioning information in
its own metadata log. Is there a specific reason that you want to
Hi, I'm thinking of using Structured Streaming instead of old streaming,
but I need to be able to save results to Hive table. Documentation for file
sink says(
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks):
"Supports writes to partitioned tables. ".
12 matches
Mail list logo