[no subject]

2021-11-18 Thread Sam Elamin
unsubscribe

Re: REST Structured Steaming Sink

2020-07-03 Thread Sam Elamin
would be >> all different for almost every use case - that makes it hard to generalize >> and requires implementation to be pretty much complicated to be flexible >> enough. >> >>> >> >>> I'm not aware of any custom sink implementing REST so your b

REST Structured Steaming Sink

2020-07-01 Thread Sam Elamin
Hi All, We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming? For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. I can't seem to find a

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Sam Elamin
Hi Mich I wrote a connector to make it easier to connect Bigquery and Spark Have a look here https://github.com/samelamin/spark-bigquery/ Your feedback is always welcome Kind Regards Sam On Tue, Dec 18, 2018 at 7:46 PM Mich Talebzadeh wrote: > Thanks Jorn. I will try that. Requires

Re: from_json()

2017-08-28 Thread Sam Elamin
Hi jg, Perhaps I am misunderstanding you, but if you just want to create a new schema from a df its fairly simple, assuming you have a schema already predefined or in a string. i.e. val newSchema = DataType.fromJson(json_schema_string) then all you need to do is re-create the dataframe using

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
Well done! This is amazing news :) Congrats and really cant wait to spread the structured streaming love! On Mon, Jul 17, 2017 at 5:25 PM, kant kodali wrote: > +1 > > On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > >> Awesome! Congrats! Can't

Re: Restful API Spark Application

2017-05-12 Thread Sam Elamin
Hi Nipun Have you checked out the job servwr https://github.com/spark-jobserver/spark-jobserver Regards Sam On Fri, 12 May 2017 at 21:00, Nipun Arora wrote: > Hi, > > We have written a java spark application (primarily uses spark sql). We > want to expand this to

Re: Spark Testing Library Discussion

2017-04-29 Thread Sam Elamin
getting into our testing 'how-to' stuff this week. I'll > scrape our org specific stuff and put it up to github this week as > well. It'll be in python so maybe we'll get both use cases covered > with examples :) > > G > > On 27 April 2017 at 03:46, Sam Elamin <hussam.ela.

Re: Spark Testing Library Discussion

2017-04-27 Thread Sam Elamin
Hi @Lucas I certainly would love to write an integration testing library for workflows, I have a few ideas I would love to share with others and they are focused around Airflow since that is what we use As promised here is

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
Hi Anna There are a variety of options for launching spark clusters. I doubt people run spark in a. Single EC2 instance, certainly not in production I don't think I don't have enough information of what you are trying to do but if you are just trying to set things up from scratch then I think

Re: How to convert Dstream of JsonObject to Dataframe in spark 2.1.0?

2017-04-24 Thread Sam Elamin
you have 2 options 1 )Clean ->Write your own parser to through each property and create a dataset 2) Hacky but simple -> Convert to json string then read in using spark.read.json(jsonString) Please bear in mind the second option is expensive which is why it is hacky I wrote my own parser here

Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Sam Elamin
n a final manual "deploy" button can >>> address that. >>> >>> Cloud infras let you integrate cluster instantiation to the process; >>> which helps you automate things like "stage the deployment in some new VMs, >>> run acceptance tests (*), th

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Sam Elamin
Hi To be honest there are a variety of options but it all comes down to who will be querying these dashboards. If the end user is an engineer then the ELK stack is fine and I can attest to the ease of use of kibana since I used it quite heavily. On the other hand in my experience it isnt the

Re: optimising storage and ec2 instances

2017-04-11 Thread Sam Elamin
Hi Zeming Yu, Steve Just to add, we are also going down partitioning using this route but you should know if you are in AWS land, you are most likely going to use EMRs at any given time At the moment EMRs does not do recursive search on wildcards, see this

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sam Elamin
and target data to look like. If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines Kind Regards Sam On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 7 Apr 2017, at 18:40, Sam El

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
r some CI workflow, that can do scheduled >> builds and tests. Works well if you can do some build test before even >> submitting it to a remote cluster >> >> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote: >> >> Hi Shyla >> >&

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Hi Shyla You have multiple options really some of which have been already listed but let me try and clarify Assuming you have a spark application in a jar you have a variety of options You have to have an existing spark cluster that is either running on EMR or somewhere else. *Super simple /

Re: Executor unable to pick postgres driver in Spark standalone cluster

2017-04-04 Thread Sam Elamin
Hi Rishikesh, Sounds like the postgres driver isnt being loaded on the path. To try and debug it try submit the application with the --jars e.g. spark-submit {application.jar} --jars /home/ubuntu/downloads/ postgres/postgresql-9.4-1200-jdbc41.jar If that does not work then there is a problem

Contributing to Spark

2017-03-19 Thread Sam Elamin
Hi All, I would like to start contributing to Spark if possible, its an amazing technology and I would love to get involved The contributing page states this "consult the list of starter tasks in JIRA, or ask the user@spark.apache.org mailing list."

Re: Spark and continuous integration

2017-03-14 Thread Sam Elamin
On 14 Mar 2017, at 12:44, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 13 Mar 2017, at 13:24, Sam Elamin <hussam.ela...@gmail.com> wrote: > > Hi Jorn > > Thanks for the prompt reply, really we have 2 main concerns with CD, > ensuring tests pasts and li

Re: Spark and continuous integration

2017-03-13 Thread Sam Elamin
ocker deployment etc. I am not sure if new > starters should be responsible for the build pipeline, thus I am not sure > that i understand your concern in this area. > > From my experience, integration tests for Spark can be run on any of these > platforms. > > Best regards > > &

Spark and continuous integration

2017-03-13 Thread Sam Elamin
Hi Folks This is more of a general question. What's everyone using for their CI /CD when it comes to spark We are using Pyspark but potentially looking to make to spark scala and Sbt in the future One of the suggestions was jenkins but I know the UI isn't great for new starters so I'd rather

Re: How to unit test spark streaming?

2017-03-07 Thread Sam Elamin
Hey kant You can use holdens spark test base Have a look at some of the specs I wrote here to give you an idea https://github.com/samelamin/spark-bigquery/blob/master/src/test/scala/com/samelamin/spark/bigquery/BigQuerySchemaSpecs.scala Basically you abstract your transformations to take in a

Re: using spark to load a data warehouse in real time

2017-03-01 Thread Sam Elamin
Hi Adaryl Having come from a Web background myself I completely understand your confusion so let me try to clarify a few things First and foremost, Spark is a data processing engine not a general framework. In the Web applications and frameworks world you load the entities, map them to the UI

Re: Structured Streaming: How to handle bad input

2017-02-23 Thread Sam Elamin
Hi Jayesh So you have 2 problems here 1) Data was loaded in the wrong format 2) Once you handled the wrong data the spark job will continually retry the failed batch For 2 its very easy to go into the checkpoint directory and delete that offset manually and make it seem like it never happened.

Re: quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread Sam Elamin
I personally use spark submit as it's agnostic to which platform your spark clusters are working on e.g. Emr dataproc databricks etc On Thu, 23 Feb 2017 at 08:53, nancy henry wrote: > Hi Team, > > I have set of hc.sql("hivequery") kind of scripts which i am running

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
, 2017 at 9:23 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hey Neil > > No worries! Happy to help you write it if you want, just link me to the > repo and we can write it together > > Would be fun! > > > Regards > Sam > On Sun, 19 Feb 2017 at 21:21,

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
lementing a structured > streaming connector. > > On Feb 19, 2017, at 11:54 AM, Sam Elamin <hussam.ela...@gmail.com> wrote: > > HI Niel, > > My advice would be to write a structured streaming connector. The new > structured streaming APIs were brought in to handle exactly the

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
HI Niel, My advice would be to write a structured streaming connector. The new structured streaming APIs were brought in to handle exactly the issues you describe See this blog There isnt a structured streaming

Re: Debugging Spark application

2017-02-16 Thread Sam Elamin
I recommend running spark in local mode when your first debugging your code just to understand what's happening and step through it, perhaps catch a few errors when you first start off I personally use intellij because it's my preference You can follow this guide.

Re: Enrichment with static tables

2017-02-15 Thread Sam Elamin
You can do a join or a union to combine all the dataframes to one fat dataframe or do a select on the columns you want to produce your transformed dataframe Not sure if I understand the question though, If the goal is just an end state transformed dataframe that can easily be done Regards Sam

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
ood if I read any of the JSON and if I do spark sql and it > gave me > > for json1.json > > a | b > 1 | null > > for json2.json > > a | b > null | 2 > > > On Tue, Feb 14, 2017 at 8:13 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >>

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
I may be missing something super obvious here but can't you combine them into a single dataframe. Left join perhaps? Try writing it in sql " select a from json1 and b from josn2"then run explain to give you a hint to how to do it in code Regards Sam On Tue, 14 Feb 2017 at 14:30, Aseem Bansal

Re: how to fix the order of data

2017-02-14 Thread Sam Elamin
Its because you are just printing on the rdd You can sort the df like below input.toDF().sort().collect() or if you do not want to convert to a dataframe you can use the sort by *sortByKey*([*ascending*], [*numTasks*]) Regards Sam On Tue, Feb 14, 2017 at 11:41 AM, 萝卜丝炒饭

Re: Etl with spark

2017-02-12 Thread Sam Elamin
gt; > On Feb 12, 2017, at 9:41 AM, Sam Elamin <hussam.ela...@gmail.com> wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just > use in memory list or dictionary > > So from the reading I've done today it seems.the concept of a bespoke >

Re: Etl with spark

2017-02-12 Thread Sam Elamin
y your function as a map. r = sc.textFile(list_file).map(your_function) HTH On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: Hey folks Really simple question here. I currently have an etl pipeline that reads from s3 and saves the data to an endstore I have to read

Etl with spark

2017-02-12 Thread Sam Elamin
Hey folks Really simple question here. I currently have an etl pipeline that reads from s3 and saves the data to an endstore I have to read from a list of keys in s3 but I am doing a raw extract then saving. Only some of the extracts have a simple transformation but overall the code looks the

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Here's a link to the thread http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hey Egor > > > You can use for each writer or you can writ

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Hey Egor You can use for each writer or you can write a custom sink. I personally went with a custom sink since I get a dataframe per batch https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala You can have a look at

Structured Streaming. S3 To Google BigQuery

2017-02-08 Thread Sam Elamin
Hi All Thank you all for the amazing support! I have written a BigQuery connector for structured streaming that you can find here I just tweeted about it and would really appreciated it if you

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
and try to match the type. If > you find a mismatch, you'd add a withColumn clause to cast to the correct > data type (from your "should-be" struct). > > HTH? > > Best > Ayan > > On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hussam.ela...@gmail.com> &

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
t, how would you apply the schema? > > On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks ayan but I meant how to derive the list automatically >> >> In your example you are specifying the numeric columns and I would like >> it

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
>> for k in numeric_field_list: > ... df = df.withColumn(k,df[k].cast("long")) > ... > >>> df.printSchema() > root > |-- customerid: long (nullable = true) > |-- foo: string (nullable = true) > > > On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
re that you got all the corner cases right (i.e. escaping > and what not). > > On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > > I see so for the connector I need to pass in an array/list of numerical > columns? > > Wouldnt i

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
ions.col > var df = spark.read.schema(parseSchema).json("...") > numericColumns.foreach { columnName => > df = df.withColumn(columnName, col(columnName).cast("long")) > } > > > > On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
o change the type after the data has been loaded > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html> > . > > On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...

Re: specifing schema on dataframe

2017-02-04 Thread Sam Elamin
> Remove the " from the number that it will work > > Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> > escreveu: > >> Hi All >> >> I would like to specify a schema when reading from a json but when trying >> to map a numb

specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi All I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy! When inferring the schema customer id is set to String, and I would like to cast it as Double so df1 is corrupted while df2 shows

Re: java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef

2017-02-04 Thread Sam Elamin
Hi sathyanarayanan zero() on scala.runtime.VolatileObjectRef has been introduced in Scala 2.11 You probably have a library compiled against Scala 2.11 and running on a Scala 2.10 runtime. See v2.10: https://github.com/scala/scala/blob/2.10.x/src/library/scala/runtime/VolatileObjectRef.java