Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
Well done! This is amazing news :) Congrats and really cant wait to spread the structured streaming love! On Mon, Jul 17, 2017 at 5:25 PM, kant kodali wrote: > +1 > > On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > >> Awesome! Congrats! Can't

Re: Updating

2017-04-13 Thread Sam Elamin
perfect, thanks Sean! On Thu, Apr 13, 2017 at 4:21 PM, Sean Owen <so...@cloudera.com> wrote: > The site source is at https://github.com/apache/spark-website/ > > On Thu, Apr 13, 2017 at 4:20 PM Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Hey all >&g

Updating

2017-04-13 Thread Sam Elamin
Hey all Who do I need to talk to in order to update the Useful developer tools page ? I want to update the build instructions for IntelliJ as they do not apply at the moment Regards Sam

Re: Support for decimal separator (comma or period) in spark 2.1

2017-02-23 Thread Sam Elamin
Hi Arkadiuz Not sure if there is a localisation ability but I'm sure other will correct me if I'm wrong What you could do is write a udf function that replaces the commas with a . Assuming you know the column in question Regards Sam On Thu, 23 Feb 2017 at 12:31, Arkadiusz Bicz

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Sam Elamin
ng-guide.html#output-sinks> > / pushes latest answer to the js running in a browser using the > StreamingQueryListener > <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html>. > This is packaged up nicely in display(),

Re: Spark Improvement Proposals

2017-02-16 Thread Sam Elamin
Hi Folks I thought id chime in as someone new to the process so feel free to disregard it if it doesn't make sense. I definitely agree that we need a new forum to identify or discuss changes as JIRA isnt exactly the best place to do that, its a Bug tracker first and foremost. For example I was

Structured Streaming Spark Summit Demo - Databricks people

2017-02-15 Thread Sam Elamin
Hey folks This one is mainly aimed at the databricks folks, I have been trying to replicate the cloudtrail demo Micheal did at Spark Summit. The code for it can be found here

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Sam Elamin
Congrats Takuya-san! Clearly well deserved! Well done :) On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz wrote: > Congratulations! > > > On 02/13/2017 08:16 PM, Reynold Xin wrote: > > Hi all, > > > > Takuya-san has recently been elected an Apache Spark committer.

Re: [Newbie] spark conf

2017-02-10 Thread Sam Elamin
HADOOP_CONF_DIR.) > > Also this is more of a user@ question. > > On Fri, Feb 10, 2017 at 1:35 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Hi All, > > > > > > really newbie question here folks, i have properties like my aws access > and > &

Re: [Newbie] spark conf

2017-02-10 Thread Sam Elamin
clusters etc) ? Regards Sam On Fri, Feb 10, 2017 at 9:36 PM, Reynold Xin <r...@databricks.com> wrote: > You can put them in spark's own conf/spark-defaults.conf file > > On Fri, Feb 10, 2017 at 10:35 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Hi All, >

[Newbie] spark conf

2017-02-10 Thread Sam Elamin
Hi All, really newbie question here folks, i have properties like my aws access and secret keys in the core-site.xml in hadoop among other properties, but thats the only reason I have hadoop installed which seems a bit of an overkill. Is there an equivalent of core-site.xml for spark so I dont

Structured Streaming. S3 To Google BigQuery

2017-02-08 Thread Sam Elamin
Hi All Thank you all for the amazing support! I have written a BigQuery connector for structured streaming that you can find here I just tweeted about it and would really appreciated it if you

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
again Micheal :) On Tue, Feb 7, 2017 at 8:44 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Sorry those are methods I wrote so you can ignore them :) > > so just adding a path parameter tells spark thats where the update log is? > > Do I check for the unique id there and ident

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Batch so that you can > make sure you don't commit the same transaction more than once. > > On Tue, Feb 7, 2017 at 3:29 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Hi Micheal >> >> If thats the case for the below example, where should i be reading

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
_metadata` and > only read files that are present in that log (ignore anything else). > > On Tue, Feb 7, 2017 at 1:16 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Ah I see ok so probably it's the retry that's causing it >> >> So when you say I'll

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
above, but we could also optimize this > away by tracking more information about batch progress. > > On Tue, Feb 7, 2017 at 12:25 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Hmm ok I understand that but the job is running for a good few mins before > I kill it so ther

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
s is how we get atomic commits even when > there are files in more than one directory. When reading the files with > Spark, we'll detect this directory and use it instead of listStatus to find > the list of valid files. > > On Tue, Feb 7, 2017 at 9:05 AM, Sam Elamin <hussam.ela...@gm

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Regards Sam On Tue, Feb 7, 2017 at 4:58 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Thanks Micheal! > > > > On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Here a JIRA: https://issues.apache.org/jira/browse/SPARK

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Thanks Micheal! On Tue, Feb 7, 2017 at 4:49 PM, Michael Armbrust <mich...@databricks.com> wrote: > Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 > > We should add this soon. > > On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin <hussam.ela...@gmail.c

Structured Streaming. Dropping Duplicates

2017-02-07 Thread Sam Elamin
Hi All When trying to read a stream off S3 and I try and drop duplicates I get the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets;; Whats strange if I

Re: specifing schema on dataframe

2017-02-04 Thread Sam Elamin
> Remove the " from the number that it will work > > Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> > escreveu: > >> Hi All >> >> I would like to specify a schema when reading from a json but when trying >> to map a numb

specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi All I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy! When inferring the schema customer id is set to String, and I would like to cast it as Double so df1 is corrupted while df2 shows

Re: Structured Streaming Schema Issue

2017-02-03 Thread Sam Elamin
is called. > > > > On Thu, Feb 2, 2017 at 7:30 AM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Hi All > > Ive done a bit more digging to where exactly this happens. It seems like > the schema is infered again after the data leaves the source and

Re: Structured Streaming Schema Issue

2017-02-02 Thread Sam Elamin
gType,true)) On Thu, Feb 2, 2017 at 12:04 AM, Sam Elamin <hussam.ela...@gmail.com> wrote: > There isn't a query per se.im writing the entire dataframe from the > output of the read stream. Once I got that working I was planning to test > the query aspect > > > I'll do a bit

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
tion class in Spark, and debug stuff there. > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L523 > > On Wed, Feb 1, 2017 at 3:49 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > Yeah

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
getBatch, has the expected schema? > > On Wed, Feb 1, 2017 at 3:33 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks for the quick response TD! >> >> Ive been trying to identify where exactly this transformation happens >> >> The readStrea

Re: Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
streaming Dataset returned by > `readStream`, and the schema of the DataFrame returned by the sources > getBatch. > > On Wed, Feb 1, 2017 at 3:25 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Hi All >> >> I am writing a bigquery connector here >&

Structured Streaming Schema Issue

2017-02-01 Thread Sam Elamin
Hi All I am writing a bigquery connector here and I am getting a strange error with schemas being overwritten when a dataframe is passed over to the Sink for example the source returns this StructType WARN streaming.BigQuerySource:

Re: Structured Streaming Source error

2017-01-31 Thread Sam Elamin
t another newer version > to run. As the Source APIs are not stable, Spark doesn't guarantee that > they are binary compatibility. > > On Tue, Jan 31, 2017 at 1:39 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Hi Folks >> >> >> I am getting

Structured Streaming Source error

2017-01-31 Thread Sam Elamin
Hi Folks I am getting a weird error when trying to write a BigQuery Structured Streaming source Error: java.lang.AbstractMethodError: com.samelamin.spark.bigquery.streaming.BigQuerySource.commit(Lorg/apache/spark/sql/execution/streaming/Offset;)V at