Re: [Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-02 Thread Jungtaek Lim
Hi, I have tried it via https://github.com/apache/spark/pull/21617 but soon realized that it is not accurate count of late input rows because Spark lazily applies watermark and discards rows at state operator(s) which inputs are not necessarily same as origin input rows (some already filtered

Re: Spark Druid Ingestion

2018-07-02 Thread gosoy
Hi Nayan, Were you able to resolve this issue? Is it because of some file/folder permission problems? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

How to avoid duplicate column names after join with multiple conditions

2018-07-02 Thread Nirav Patel
Expr is `df1(a) === df2(a) and df1(b) === df2(c)` How to avoid duplicate column 'a' in result? I don't see any api that combines both. Rename manually? --   

[Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-02 Thread subramgr
Hi all, Do we have some logs or some metrics that get recorded in log files or some metrics sinker about the number of events that are ignored due to watermark in structured streaming? Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Question about Spark, Inner Join and Delegation to a Parquet Table

2018-07-02 Thread Mike Buck
I have a question about Spark and how it delegates filters to a Parquet-based table. I have two tables in Hive in Parquet format. Table1 has with four columns of type double and table2 has two columns of type double. I am doing an INNER JOIN of the following: SELECT table1.name FROM table1

Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-02 Thread kant kodali
Hi All, I get the below error quite often when I do an stream-stream inner join on two data frames. After running through several experiments stream-stream joins dont look stable enough for production yet. any advice on this? Thanks! java.util.ConcurrentModificationException: KafkaConsumer is

union of multiple twitter streams [spark-streaming-twitter_2.11]

2018-07-02 Thread Imran Rajjad
Hello, Has anybody tried to union two streams of Twitter Statues? I am instantiating two twitter streams through two different set of credentials and passing them through a union function, but the console does not show any activity neither there are any errors. --static function that returns

Re: Dataframe reader does not read microseconds, but TimestampType supports microseconds

2018-07-02 Thread Jörn Franke
How do you read the files ? Do you have some source code ? It could be related to the Json data source. What Spark version do you use? > On 2. Jul 2018, at 09:03, Colin Williams > wrote: > > I'm confused as to why Sparks Dataframe reader does not support reading json > or similar with

Dataframe reader does not read microseconds, but TimestampType supports microseconds

2018-07-02 Thread Colin Williams
I'm confused as to why Sparks Dataframe reader does not support reading json or similar with microsecond timestamps to microseconds, but instead reads into millis. This seems strange when the TimestampType supports microseconds. For example create a schema for a json object with a column of