Re: Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Jungtaek Lim
Please refer the structured streaming guide doc which is very clear of representing when the query will have unbounded state. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking Quoting the doc: In other words, you will

Re: Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Abhijeet Kumar
You mean to say that Spark will store all the data in memory forever :) > On 10-Dec-2018, at 6:16 PM, Sandeep Katta > wrote: > > Hi Abhijeet, > > You are using inner join with unbounded state which means every data in > stream ll match with other stream infinitely, > If you want the

Behavior of checkpointLocation from options vs setting conf spark.sql.streaming.checkpointLocation

2018-12-10 Thread Shubham Chaurasia
Hi, I would like to confirm checkpointing behavior, I have observed following scenarios: *1)* When I set checkpointLocation from streaming query like: val query = rateDF.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("1 seconds")).*option("checkpointLocation",

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Hyukjin Kwon
Ah, sorry. I missed it. It works correctly. Thanks. 2018년 12월 11일 (화) 오전 10:47, Sean Owen 님이 작성: > Did you do the step where you sync your GitHub and ASF account? After an > hour you should get an email and then you can. > > On Mon, Dec 10, 2018, 8:01 PM Hyukjin Kwon >> BTW, should I be able to

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
Did you do the step where you sync your GitHub and ASF account? After an hour you should get an email and then you can. On Mon, Dec 10, 2018, 8:01 PM Hyukjin Kwon BTW, should I be able to close PRs via GitHub UI right now or is there > another way to do it? Looks I'm not seeing the close button.

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Hyukjin Kwon
BTW, should I be able to close PRs via GitHub UI right now or is there another way to do it? Looks I'm not seeing the close button. 2018년 12월 11일 (화) 오전 1:51, Sean Owen 님이 작성: > Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra > noise. > > On Mon, Dec 10, 2018 at 11:37 AM

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra noise. On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin wrote: > > Hmm, it also seems that github comments are being sync'ed to jira. > That's gonna get old very quickly, we should probably ask infra to > disable that (if we

Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-10 Thread Ryan Blue
Anyone can attend the v2 sync. You just need to let me know what email address you'd like to have added. Sorry it is invite-only. That's a limitation of the platform (hangouts), the Spark community welcomes anyone that wants to participate. On Mon, Dec 10, 2018 at 1:00 AM JOAQUIN GUANTER

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Marcelo Vanzin
Hmm, it also seems that github comments are being sync'ed to jira. That's gonna get old very quickly, we should probably ask infra to disable that (if we can't do it ourselves). On Mon, Dec 10, 2018 at 9:13 AM Sean Owen wrote: > > Update for committers: now that my user ID is synced, I can >

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
Update for committers: now that my user ID is synced, I can successfully push to remote https://github.com/apache/spark directly. Use that as the 'apache' remote (if you like; gitbox also works). I confirmed the sync works both ways. As a bonus you can directly close pull requests when needed

Apache Spark git repo moved to gitbox.apache.org

2018-12-10 Thread Sean Owen
Per the thread last week, the Apache Spark repos have migrated from https://git-wip-us.apache.org/repos/asf to https://gitbox.apache.org/repos/asf Non-committers: This just means repointing any references to the old repository to the new one. It won't affect you if you were already referencing

Re: Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Sandeep Katta
Hi Abhijeet, You are using inner join with unbounded state which means every data in stream ll match with other stream infinitely, If you want the intended behaviour you should add time stamp conditions or window operator in join condition On Mon, 10 Dec 2018 at 5:23 PM, Abhijeet Kumar

Why does join use rows that were sent after watermark of 20 seconds?

2018-12-10 Thread Abhijeet Kumar
Hello, I’m using watermark to join two streams as you can see below: val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds") val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds") val join_df = order_wm .join(invoice_wm, order_wm.col("s_order_id") ===

Re: Pushdown in DataSourceV2 question

2018-12-10 Thread Alessandro Solimando
I think you are generally right, but there are so many different scenarios that it might not always be the best option, consider for instance a "fast" network in between a single data source and "Spark", lots of data, an "expensive" (with low selectivity) expression as Wenchen suggested. In such

RE: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-10 Thread JOAQUIN GUANTER GONZALBEZ
Ah, yes, you are right. The DataSourceV2 APIs wouldn’t let an implementor mark a DataSet as “bucketed”. Is there any documentation about the upcoming table support for data source v2 or any way of getting invited to the DataSourceV2 community sync? Thanks! Ximo. De: Wenchen Fan Enviado el: