Re: Java 9

2017-02-07 Thread kant kodali
Well and the module system! On Tue, Feb 7, 2017 at 4:03 AM, Timur Shenkao wrote: > If I'm not wrong, they got fid of *sun.misc.Unsafe *in Java 9. > > This class is till used by several libraries & frameworks. > >

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Ignore me, a bit more digging and I was able to find the filesink source Following that pattern worked a treat! Thanks again

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Sorry those are methods I wrote so you can ignore them :) so just adding a path parameter tells spark thats where the update log is? Do I check for the unique id there and identify which batch was written and which weren't Are there any examples of this out there? there aren't much connectors

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
The JSON log is only used by the file sink (which it doesn't seem like you are using). Though, I'm not sure exactly what is going on inside of setupGoogle or how tableReferenceSource is used. Typically you would run df.writeStream.option("path", "/my/path")... and then the transaction log would

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hi Micheal If thats the case for the below example, where should i be reading these json log files first? im assuming sometime between df and query? val df = spark .readStream .option("tableReferenceSource",tableName) .load() setUpGoogle(spark.sqlContext) val query = df

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Read the JSON log of files that is in `/your/path/_spark_metadata` and only read files that are present in that log (ignore anything else). On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin wrote: > Ah I see ok so probably it's the retry that's causing it > > So when you say

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Shixiong(Ryan) Zhu
@Sean, I'm using Java 8 but don't see these errors until I manually build the API docs. Hence I think dropping Java 7 support may not help. Right now we don't build docs in most of builds as building docs takes a long time (e.g., https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/2889/

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Ah I see ok so probably it's the retry that's causing it So when you say I'll have to take this into account, how do I best do that? My sink will have to know what was that extra file. And i was under the impression spark would automagically know this because of the checkpoint directory set when

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Sorry, I think I was a little unclear. There are two things at play here. - Exactly-once semantics with file output: spark writes out extra metadata on which files are valid to ensure that failures don't cause us to "double count" any of the input. Spark 2.0+ detects this info automatically

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Felix Cheung
+1 for all the great work going in for this, HyukjinKwon, and +1 on what Sean says about "Jenkins builds with Java 8" and we should catch these nasty javadoc8 issue quickly. I think that would be the great first step to move away from java 7 _ From: Reynold Xin

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hmm ok I understand that but the job is running for a good few mins before I kill it so there should not be any jobs left because I can see in the log that its now polling for new changes, the latest offset is the right one After I kill it and relaunch it picks up that same file? Sorry if I

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
It is always possible that there will be extra jobs from failed batches. However, for the file sink, only one set of files will make it into _spark_metadata directory log. This is how we get atomic commits even when there are files in more than one directory. When reading the files with Spark,

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
On another note, when it comes to checkpointing on structured streaming I noticed if I have a stream running off s3 and I kill the process. The next time the process starts running it dulplicates the last record inserted. is that normal? So say I have streaming enabled on one folder "test"

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Thanks Micheal! On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust wrote: > Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 > > We should add this soon. > > On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin > wrote: > >> Hi All >> >>

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 We should add this soon. On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin wrote: > Hi All > > When trying to read a stream off S3 and I try and drop duplicates I get > the following error: > > Exception in thread

Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hi All When trying to read a stream off S3 and I try and drop duplicates I get the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets;; Whats strange if I

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
Bumping this. Given we see the occassional build breaks with Java 8, we should reconsider this and do it for 2.2 or 2.3. By the time 2.2 is released, it will almost be an year since this thread started. On Sun, Jul 24, 2016 at 12:59 AM, Mark Hamstra wrote: > Sure,

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2017-02-07 Thread Reynold Xin
BTW I created a JIRA ticket for tracking: https://issues.apache.org/jira/browse/SPARK-19493 We of course shouldn't do anything until we achieve consensus. On Tue, Feb 7, 2017 at 3:47 PM, Reynold Xin wrote: > Bumping this. > > Given we see the occassional build breaks with

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-07 Thread StanZhai
>From thread dump page of Executor of WebUI, I found that there are about 1300 threads named "DataStreamer for file /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" in TIMED_WAITING state like

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Reynold Xin
I don't know if this would help but I think we can also officially stop supporting Java 7 ... On Tue, Feb 7, 2017 at 1:06 PM, Sean Owen wrote: > I believe that if we ran the Jenkins builds with Java 8 we would catch > these? this doesn't require dropping Java 7 support or

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Sean Owen
I believe that if we ran the Jenkins builds with Java 8 we would catch these? this doesn't require dropping Java 7 support or anything. @joshrosen I know we are just now talking about modifying the Jenkins jobs to remove old Hadoop configs. Is it possible to change the master jobs to use Java 8?

Re: Java 9

2017-02-07 Thread Timur Shenkao
If I'm not wrong, they got fid of *sun.misc.Unsafe *in Java 9. This class is till used by several libraries & frameworks. http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ On Tue, Feb 7, 2017 at 12:51 PM, Pete Robbins wrote: > Yes, I agree but it

Re: Java 9

2017-02-07 Thread Pete Robbins
Yes, I agree but it may be worthwhile starting to look at this. I was just trying a build and it trips over some of the now defunct/inaccessible sun.misc classes. I was just interested in hearing if anyone has already gone through this to save me duplicating effort. Cheers, On Tue, 7 Feb 2017

Re: Java 9

2017-02-07 Thread Sean Owen
I don't think anyone's tried it. I think we'd first have to agree to drop Java 7 support before that could be seriously considered. The 8-9 difference is a bit more of a breaking change. On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins wrote: > Is anyone working on support for

Java 9

2017-02-07 Thread Pete Robbins
Is anyone working on support for running Spark on Java 9? Is this in a roadmap anywhere? Cheers,