Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
using save() on the dataset (after the transformations, before them it is ok to perform save() on the dataset). I hope the question is clearer (for anybody who's reading) now. Le sam. 11 mars 2023 à 20:15, Mich Talebzadeh a écrit : > collectAsList brings all the data into the driver which is a single JVM

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
not sure what you mean by your question, but it is not helping in any case Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh a écrit : > > > ... To note that if I execute collectAsList on the dataset at the > beginning of the program > > What do you think collectAsList doe

What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
Hello guys, I am launching through code (client mode) a Spark program to run in Hadoop. If I execute on the dataset methods of the likes of show() and count() or collectAsList() (that are displayed in the Spark UI) after performing heavy transformations on the columns then the mentioned methods

How to allocate vcores to driver (client mode)

2023-03-10 Thread sam smith
Hi, I am launching through code (client mode) a Spark program to run in Hadoop. Whenever I check the executors tab of Spark UI I always get 0 as the number of vcores for the driver. I tried to change that using *spark.driver.cores*, or also *spark.yarn.am.cores* in the SparkSession configuration

How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello, I use Yarn client mode to submit my driver program to Hadoop, the dataset I load is from the local file system, when i invoke load("file://path") Spark complains about the csv file being not found, which i totally understand, since the dataset is not in any of the workers or the

Re: How to explode array columns of a dataframe having the same length

2023-02-16 Thread sam smith
,"C","E"), List("B","D","null"), List("null","null","null")) > and use flatmap with that method. > > In Scala, this would read: > > df.flatMap { row => (row.getSeq[String](0), row.getSeq[String](1), > ro

How to explode array columns of a dataframe having the same length

2023-02-14 Thread sam smith
Hello guys, I have the following dataframe: *col1* *col2* *col3* ["A","B","null"] ["C","D","null"] ["E","null","null"] I want to explode it to the following dataframe: *col1* *col2* *col3* "A" "C" "E" "B" "D" "null" "null" "null" "null" How to do that (preferably in Java) using

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-13 Thread sam smith
ot;)).toDF("a", "b", "c") > scala> df.select(df.columns.map(column => > collect_set(col(column)).as(column)): _*).show() > +++------+ > > | a| b| c| > +++--+ > |[1, 2, 3, 4]|[20, 10]|[on

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
lumnName, > collect_set(col(columnName)).as(columnName)); > } > > Then you have a single DataFrame that computes all columns in a single > Spark job. > > But this reads all distinct values into a single partition, which has the > same downside as collect, so this is as bad

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
gt; > On Sun, Feb 12, 2023 at 10:59 AM sam smith > wrote: > >> @Enrico Minack Thanks for "unpivot" but I am >> using version 3.3.0 (you are taking it way too far as usual :) ) >> @Sean Owen Pls then show me how it can be improved by >> code. >> >

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
) { df= df.withColumn(columnName, df.select(columnName).distinct().col(columnName)); } Le sam. 11 févr. 2023 à 13:11, Enrico Minack a écrit : > You could do the entire thing in DataFrame world and write the result to > disk. All you need is unpivot (to be released in Spark 3.4.0, soon). > &

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
lar to > what you do here. Just need to do the cols one at a time. Your current code > doesn't do what you want. > > On Fri, Feb 10, 2023, 3:46 PM sam smith > wrote: > >> Hi Sean, >> >> "You need to select the distinct values of each col one at a time&

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
Hi Apotolos, Can you suggest a better approach while keeping values within a dataframe? Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos < papad...@csd.auth.gr> a écrit : > Dear Sam, > > you are assuming that the data fits in the memory of your local machine. > You ar

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
t() the > result as you do here. > > On Fri, Feb 10, 2023, 3:34 PM sam smith > wrote: > >> I want to get the distinct values of each column in a List (is it good >> practice to use List here?), that contains as first element the column >> name, and the other ele

How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I want to get the distinct values of each column in a List (is it good practice to use List here?), that contains as first element the column name, and the other element its distinct values so that for a dataset we get a list of lists, i do it this way (in my opinion no so fast): List> finalList

Can we upload a csv dataset into Hive using SparkSQL?

2022-12-10 Thread sam smith
Hello, I want to create a table in Hive and then load a CSV file content into it all by means of Spark SQL. I saw in the docs the example with the .txt file BUT can we do instead something like the following to accomplish what i want? : String warehouseLocation = new

Re: Aggregate over a column: the proper way to do

2022-04-10 Thread sam smith
Exact, one row, and two columns Le sam. 9 avr. 2022 à 17:44, Sean Owen a écrit : > But it only has one row, right? > > On Sat, Apr 9, 2022, 10:06 AM sam smith > wrote: > >> Yes. Returns the number of rows in the Dataset as *long*. but in my case >> the aggrega

Re: Aggregate over a column: the proper way to do

2022-04-09 Thread sam smith
Yes. Returns the number of rows in the Dataset as *long*. but in my case the aggregation returns a table of two columns. Le ven. 8 avr. 2022 à 14:12, Sean Owen a écrit : > Dataset.count() returns one value directly? > > On Thu, Apr 7, 2022 at 11:25 PM sam smith > wrote: >

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
ing is pointless. > > On Thu, Apr 7, 2022, 11:10 PM sam smith > wrote: > >> What if i do avg instead of count? >> >> Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit : >> >>> Wait, why groupBy at all? After the filter only rows with myCol equal to >>>

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
What if i do avg instead of count? Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit : > Wait, why groupBy at all? After the filter only rows with myCol equal to > your target are left. There is only one group. Don't group just count after > the filter? > > On Thu, Apr 7, 2022, 10:

Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
I want to aggregate a column by counting the number of rows having the value "myTargetValue" and return the result I am doing it like the following:in JAVA > long result = >

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
n't answer until this is > cleared up. > > On Mon, Jan 24, 2022 at 10:57 AM sam smith > wrote: > >> I mean the DAG order is somehow altered when executing on Hadoop >> >> Le lun. 24 janv. 2022 à 17:17, Sean Owen a écrit : >> >>> Code is not executed by

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
in files but you can order data. Still not sure what > specifically you are worried about here, but I don't think the kind of > thing you're contemplating can happen, no > > On Mon, Jan 24, 2022 at 9:28 AM sam smith > wrote: > >> I am aware of that, but whenever the chunks of c

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
uld > something, what, modify the byte code? No > > On Mon, Jan 24, 2022, 9:07 AM sam smith > wrote: > >> My point is could Hadoop go wrong about one Spark execution ? meaning >> that it gets confused (given the concurrent distributed tasks) and then >> adds wrong instr

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
s here? program execution order is still program execution > order. You are not guaranteed anything about order of concurrent tasks. > Failed tasks can be reexecuted so should be idempotent. I think the answer > is 'no' but not sure what you are thinking of here. > > On Mon, Jan 24

Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
Hello guys, I hope my question does not sound weird, but could a Spark execution on Hadoop cluster give different output than the program actually does ? I mean by that, the execution order is messed by hadoop, or an instruction executed twice..; ? Thanks for your enlightenment

Re: About some Spark technical help

2021-12-24 Thread sam smith
Thanks for the feedback Andrew. Le sam. 25 déc. 2021 à 03:17, Andrew Davidson a écrit : > Hi Sam > > It is kind of hard to review straight code. Adding some some sample data, > a unit test and expected results. Would be a good place to start. Ie. > Determine the fidelity of your

Re: About some Spark technical help

2021-12-24 Thread sam smith
why JAVA? > > Regards, > Gourav Sengupta > > On Thu, Dec 23, 2021 at 5:10 PM sam smith > wrote: > >> Hi Andrew, >> >> Thanks, here's the Github repo to the code and the publication : >> https://github.com/SamSmithDevs10/paperReplicationForReview >> >>

Re: About some Spark technical help

2021-12-23 Thread sam smith
Hi Andrew, Thanks, here's the Github repo to the code and the publication : https://github.com/SamSmithDevs10/paperReplicationForReview Kind regards Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson a écrit : > Hi Sam > > > > Can you tell us more? What is the algorithm? Can you

dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith
Hello All, I am replicating a paper's algorithm about a partitioning approach to anonymize datasets with Spark / Java, and want to ask you for some help to review my 150 lines of code. My github repo, attached below, contains both my java class and the related paper:

About some Spark technical help

2021-12-22 Thread sam smith
Hello guys, I am replicating a paper's algorithm in Spark / Java, and want to ask you guys for some assistance to validate / review about 150 lines of code. My github repo contains both my java class and the related paper, Any interested reviewer here ? Thanks.

About some Spark technical help

2021-12-22 Thread sam smith
Hello guys, I am replicating a paper's algorithm in Spark / Java, and want to ask you guys for some assistance to validate / review about 150 lines of code. My github repo contains both my java class and the related paper, Any interested reviewer here ? Thanks.

Re: About some Spark technical assistance

2021-12-13 Thread sam smith
you were added to the repo to contribute, thanks. I included the java class and the paper i am replicating Le lun. 13 déc. 2021 à 04:27, a écrit : > github url please. > > On 2021-12-13 01:06, sam smith wrote: > > Hello guys, > > > > I am replicating a paper'

About some Spark technical assistance

2021-12-12 Thread sam smith
Hello guys, I am replicating a paper's algorithm (graph coloring algorithm) in Spark under Java, and thought about asking you guys for some assistance to validate / review my 600 lines of code. Any volunteers to share the code with ? Thanks

[no subject]

2021-11-18 Thread Sam Elamin
unsubscribe

Re: Parquet Metadata

2021-06-23 Thread Sam
Hi, I only know about comments which you can add to each column where you can add these key values. Thanks. On Wed, Jun 23, 2021 at 11:31 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi folks, > > > > Maybe not the right audience but maybe you came along such an requirement.

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Sam
Like I said In my previous email, can you try this and let me know how many tasks you see? val repRdd = scoredRdd.repartition(50).cache() repRdd.take(1) Then map operation on repRdd here. I’ve done similar map operations in the past and this works. Thanks. On Wed, Jun 9, 2021 at 11:17 AM Tom

Re: REST Structured Steaming Sink

2020-07-03 Thread Sam Elamin
a streaming use-case Thoughts? Regards Sam On Thu, Jul 2, 2020 at 3:31 AM Burak Yavuz wrote: > Well, the difference is, a technical user writes the UDF and a > non-technical user may use this built-in thing (misconfigure it) and shoot > themselves in the foot. > > On Wed, Jul 1, 2020,

REST Structured Steaming Sink

2020-07-01 Thread Sam Elamin
Hi All, We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming? For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. I can't seem to find a

Avro file question

2019-11-04 Thread Sam
Hi, How do we choose between single large avro file (size much larger than HDFS block size) vs multiple smaller avro files (close to HDFS block size? Since avro is splittable, is there even a need to split a very large avro file into smaller files? I’m assuming that a single large avro file can

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-19 Thread Sam Elamin
Hi Mich I wrote a connector to make it easier to connect Bigquery and Spark Have a look here https://github.com/samelamin/spark-bigquery/ Your feedback is always welcome Kind Regards Sam On Tue, Dec 18, 2018 at 7:46 PM Mich Talebzadeh wrote: > Thanks Jorn. I will try that. Requi

Why is the max iteration for svd not configurable in mllib?

2018-08-10 Thread Sam Lendle
a PR exposing that parameter? I have not contributed to spark before, so I don’t know if a small api change like that would require a discussion beforehand. Thanks! Sam

Re: from_json()

2017-08-28 Thread Sam Elamin
this new dataframe sqlContext.createDataFrame(oldDF.rdd,newSchema) Regards Sam On Mon, Aug 28, 2017 at 5:57 PM, JG Perrin <jper...@lumeris.com> wrote: > Is there a way to not have to specify a schema when using from_json() or > infer the schema? When you read a JSON doc from disk, y

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
Well done! This is amazing news :) Congrats and really cant wait to spread the structured streaming love! On Mon, Jul 17, 2017 at 5:25 PM, kant kodali wrote: > +1 > > On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > >> Awesome! Congrats! Can't

Re: UDAFs for sketching Dataset columns with T-Digests

2017-07-06 Thread Sam Bessalah
This is interesting and very useful. Thanks. On Thu, Jul 6, 2017 at 2:33 AM, Erik Erlandson wrote: > After my talk on T-Digests in Spark at Spark Summit East, there were some > requests for a UDAF-based interface for working with Datasets. I'm > pleased to announce that I

Re: Restful API Spark Application

2017-05-12 Thread Sam Elamin
Hi Nipun Have you checked out the job servwr https://github.com/spark-jobserver/spark-jobserver Regards Sam On Fri, 12 May 2017 at 21:00, Nipun Arora <nipunarora2...@gmail.com> wrote: > Hi, > > We have written a java spark application (primarily uses spark sql). We &

Re: Spark Testing Library Discussion

2017-04-29 Thread Sam Elamin
of the series since this one is mainly about raw extracts. Thank you very much for the feedback and I will be sure to add it once I have more feedback Maybe we can create a gist of all this or even a tiny book on best practices if people find it useful Looking forward to the PR! Regards Sam On Sat

Re: Spark Testing Library Discussion

2017-04-27 Thread Sam Elamin
rt1/> is the first blog post in a series of posts I hope to write on how we build data pipelines Please feel free to retweet my original tweet <https://twitter.com/samelamin/status/857546231492612096> and share because the more ideas we have the better! Feedback is always welcome! Regards Sam

Re: help/suggestions to setup spark cluster

2017-04-26 Thread Sam Elamin
you can just use EMR which will create a cluster for you and attach a zeppelin instance as well You can also use databricks for ease of use and very little management but you will pay a premium for that abstraction Regards Sam On Wed, 26 Apr 2017 at 22:02, anna stax <annasta...@gmail.com>

Re: How to convert Dstream of JsonObject to Dataframe in spark 2.1.0?

2017-04-24 Thread Sam Elamin
r here <https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/converters/SchemaConverters.scala> which you can use to convert between JsonObjects to StructType schemas Regards Sam On Sun, Apr 23, 2017 at 7:50 PM, kant kodali <kanth...@gmail.co

Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Sam Elamin
l and would probably be better explained on a blog post, but hey this is the gist of it. If people are still interested I can write it up as a blog post adding code samples and nice diagrams! Kind Regards Sam On Wed, Apr 12, 2017 at 7:33 PM, lucas.g...@gmail.com <lucas.g...@gmail.

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Sam Elamin
mpared to the other services. I suppose in the end you are paying to abstract that knowledge away Happy to answer any questions you might have Kind Regards Sam On Wed, 12 Apr 2017 at 09:36, tencas <diego...@gmail.com> wrote: > Hi Gaurav1809 , > > I was thinking about using elast

Re: optimising storage and ec2 instances

2017-04-11 Thread Sam Elamin
est/ManagementGuide/emr-troubleshoot-errors-io.html#recurseinput> However Spark seems to be able to deal with it fine, so if you dont have a data serving layer to your customers then you should be fine Regards sam On Tue, Apr 11, 2017 at 1:21 PM, Zeming Yu <zemin...@gmail.com> wrote:

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sam Elamin
and target data to look like. If people are interested I am happy writing a blog about it in the hopes this helps people build more reliable pipelines Kind Regards Sam On Tue, Apr 11, 2017 at 11:31 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 7 Apr 2017, at 18:40, Sam El

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
r some CI workflow, that can do scheduled >> builds and tests. Works well if you can do some build test before even >> submitting it to a remote cluster >> >> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote: >> >> Hi Shyla >> >&

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
and error handling(retries,alerts etc) AWS are coming out with glue <https://aws.amazon.com/glue/> soon that does some Spark jobs but I do not think its available worldwide just yet Hope I cleared things up Regards Sam On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.c

Re: Executor unable to pick postgres driver in Spark standalone cluster

2017-04-04 Thread Sam Elamin
in the application itself and the reason it is working is because you have the dependency in your class path locally Regards Sam On Mon, Apr 3, 2017 at 2:43 PM, Rishikesh Teke <rishikesht...@gmail.com> wrote: > > Hi all, > > I was submitting the play application to spark 2.1

Contributing to Spark

2017-03-19 Thread Sam Elamin
d really appreciate if any of the contributors or PMC members would be willing to mentor or guide me in this. Any help would be greatly appreciated! Regards Sam

Re: Spark and continuous integration

2017-03-14 Thread Sam Elamin
frameworks help with that. Previously we have built data sanity checks that look at counts and numbers to produce graphs using statsd and Grafana (elk stack) but not necessarily looking at test metrics I'll definitely check it out Kind regards Sam On Tue, 14 Mar 2017 at 11:57, Jörn Franke <jorn

Re: Spark and continuous integration

2017-03-13 Thread Sam Elamin
that as well as a variety of other hosted CI tools Happy to write a blog post detailing our findings and sharing it here if people are interested Regards Sam On Mon, Mar 13, 2017 at 1:18 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Hi, > > Jenkins also now supports pipeline as code

Spark and continuous integration

2017-03-13 Thread Sam Elamin
avoid it. I've used team city but that was more focused on dot net development What are people using? Kind Regards Sam

Re: How to unit test spark streaming?

2017-03-07 Thread Sam Elamin
in a dataframe and return one, then you assert on the returned df Regards Sam On Tue, 7 Mar 2017 at 12:05, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > How to unit test spark streaming or spark in general? How do I test the > results of my transformations? Also, more importa

Re: using spark to load a data warehouse in real time

2017-03-01 Thread Sam Elamin
to be reliable and never go down then implement kafka or Kinesis. If it's a proof of concept or you are trying to validate a theory use structured streaming as it's much quicker to write, weeks and months of set up vs a few hours I hope I clarified things for you Regards Sam Sent from my iPhone

Re: Structured Streaming: How to handle bad input

2017-02-23 Thread Sam Elamin
PARQUET or whatever, I should hope whatever service/company is providing this data is providing it "correctly" to a set definition, otherwise you will have to do a pre cleaning step Perhaps someone else can suggest a better/cleaner approach Regards Sam On Thu, Feb 23, 2017

Re: quick question: best to use cluster mode or client mode for production?

2017-02-23 Thread Sam Elamin
I personally use spark submit as it's agnostic to which platform your spark clusters are working on e.g. Emr dataproc databricks etc On Thu, 23 Feb 2017 at 08:53, nancy henry wrote: > Hi Team, > > I have set of hc.sql("hivequery") kind of scripts which i am running

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
, 2017 at 9:23 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hey Neil > > No worries! Happy to help you write it if you want, just link me to the > repo and we can write it together > > Would be fun! > > > Regards > Sam > On Sun, 19 Feb 2017 at 21:21,

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
Hey Neil No worries! Happy to help you write it if you want, just link me to the repo and we can write it together Would be fun! Regards Sam On Sun, 19 Feb 2017 at 21:21, Neil Maheshwari <neil.v.maheshw...@gmail.com> wrote: > Thanks for the advice Sam. I will look into imp

Re: [Spark Streaming] Starting Spark Streaming application from a specific position in Kinesis stream

2017-02-19 Thread Sam Elamin
e/sink Hope that helps Regards Sam On Sun, Feb 19, 2017 at 5:53 PM, Neil Maheshwari < neil.v.maheshw...@gmail.com> wrote: > Thanks for your response Ayan. > > This could be an option. One complication I see with that approach is that > I do not want to miss any records tha

Re: Debugging Spark application

2017-02-16 Thread Sam Elamin
/2016-08-26-How-to-debug-remote-spark-jobs-with-IntelliJ/ Although it's for intellij you can apply the same concepts to eclipse *I think* Regards Sam On Thu, 16 Feb 2017 at 22:00, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi, > > I was looking for some URLs/docume

Re: Enrichment with static tables

2017-02-15 Thread Sam Elamin
You can do a join or a union to combine all the dataframes to one fat dataframe or do a select on the columns you want to produce your transformed dataframe Not sure if I understand the question though, If the goal is just an end state transformed dataframe that can easily be done Regards Sam

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
ood if I read any of the JSON and if I do spark sql and it > gave me > > for json1.json > > a | b > 1 | null > > for json2.json > > a | b > null | 2 > > > On Tue, Feb 14, 2017 at 8:13 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >>

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Sam Elamin
I may be missing something super obvious here but can't you combine them into a single dataframe. Left join perhaps? Try writing it in sql " select a from json1 and b from josn2"then run explain to give you a hint to how to do it in code Regards Sam On Tue, 14 Feb 2017 at 14:30, As

Re: how to fix the order of data

2017-02-14 Thread Sam Elamin
Its because you are just printing on the rdd You can sort the df like below input.toDF().sort().collect() or if you do not want to convert to a dataframe you can use the sort by *sortByKey*([*ascending*], [*numTasks*]) Regards Sam On Tue, Feb 14, 2017 at 11:41 AM, 萝卜丝炒饭 <1427

Re: Etl with spark

2017-02-12 Thread Sam Elamin
gt; > On Feb 12, 2017, at 9:41 AM, Sam Elamin <hussam.ela...@gmail.com> wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just > use in memory list or dictionary > > So from the reading I've done today it seems.the concept of a bespoke >

Re: Etl with spark

2017-02-12 Thread Sam Elamin
? Regards Sam On Sun, 12 Feb 2017 at 12:13, ayan guha <guha.a...@gmail.com> wrote: You can store the list of keys (I believe you use them in source file path, right?) in a file, one key per line. Then you can read the file using sc.textFile (So you will get a RDD of file paths) and then appl

Etl with spark

2017-02-12 Thread Sam Elamin
from s3 because it infers my schema Regards Sam

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Here's a link to the thread http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hey Egor > > > You can use for each writer or you can writ

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
at how I implemented something similar to file sink that in the event if a failure skips batches already written Also have a look at Micheals reply to me a few days ago on exactly the same topic. The email subject was called structured streaming. Dropping duplicates Regards Sam On Sat, 11 Feb 2017

Structured Streaming. S3 To Google BigQuery

2017-02-08 Thread Sam Elamin
ed it if you retweeted when you get a chance The more people know about it and use it the more feedback I can get to make the connector better! Ofcourse PRs and feedback are always welcome :) Thanks again! Regards Sam

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
and try to match the type. If > you find a mismatch, you'd add a withColumn clause to cast to the correct > data type (from your "should-be" struct). > > HTH? > > Best > Ayan > > On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hussam.ela...@gmail.com> &

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
t, how would you apply the schema? > > On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> > wrote: > >> Thanks ayan but I meant how to derive the list automatically >> >> In your example you are specifying the numeric columns and I would like >> it

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
>> for k in numeric_field_list: > ... df = df.withColumn(k,df[k].cast("long")) > ... > >>> df.printSchema() > root > |-- customerid: long (nullable = true) > |-- foo: string (nullable = true) > > > On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
the columns in the old df. For each column cast it correctly and generate a new df? Would you recommend that? Regards Sam On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com> wrote: > If you already have the expected schema, and you know that all numbers > will always

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
I see so for the connector I need to pass in an array/list of numerical columns? Wouldnt it be simpler to just regex replace the numbers to remove the quotes? Regards Sam On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com> wrote: > Specifying the schema whe

Re: specifing schema on dataframe

2017-02-05 Thread Sam Elamin
ntify which fields are numbers and which arent then recreate the json But to be honest that doesnt seem like the cleanest approach, so happy for advice on this Regards Sam On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com> wrote: > -dev > > You can use withColumn t

Re: specifing schema on dataframe

2017-02-04 Thread Sam Elamin
like to convert it into a dataframe which I pass the schema into Whats the best way to do this? I doubt removing all the quotes in the JSON is the best solution is it? Regards Sam On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hi Sam

specifing schema on dataframe

2017-02-04 Thread Sam Elamin
uot;535137"}"""))) df1.show(1) df2.show(1) Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is! Kind Regards Sam

Re: java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef

2017-02-04 Thread Sam Elamin
v2.11: https://github.com/scala/scala/blob/2.11.x/src/library/scala/runtime/VolatileObjectRef.java Regards Sam On Sat, 4 Feb 2017 at 09:24, sathyanarayanan mudhaliyar < sathyanarayananmudhali...@gmail.com> wrote: > Hi , > I got the error below when executed > > Excepti

Upgrading to Spark 2.0.1 broke array in parquet DataFrame

2016-11-04 Thread Sam Goodwin
I have a table with a few columns, some of which are arrays. Since upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null when reading in a DataFrame. When writing the Parquet files, the schema of the column is specified as StructField("packageIds",ArrayType(StringType)) The

Re: Spark join and large temp files

2016-08-09 Thread Sam Bessalah
Have you tried to broadcast your small table table in order to perform your join ? joined = bigDF.join(broadcast(smallDF, ) On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab wrote: > Hi Deepak, > No...not really. Upping the disk size is a solution, but more expensive as > you

Re: hdfs-ha on mesos - odd bug

2015-09-14 Thread Sam Bessalah
I don't know about the broken url. But are you running HDFS as a mesos framework? If so is it using mesos-dns? Then you should resolve the namenode via hdfs:/// On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett wrote: > I'm hitting an odd issue with running spark on

Re: *Metrics API is odd in MLLib

2015-07-28 Thread Sam
. The repo complete with detailed documentation can be found here https://github.com/samthebest/sceval. Many thanks, Sam On Thu, Jun 18, 2015 at 11:00 AM, Sam samthesav...@gmail.com wrote: Firstly apologies for the header of my email containing some junk, I believe it's due to a copy and paste error

Re: *Metrics API is odd in MLLib

2015-06-18 Thread Sam
/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127). Feel free to submit a PR to make it public. -Xiangrui On Mon, Jun 15, 2015 at 7:13 AM, Sam samthesav...@gmail.com wrote: Google+ https://plus.google.com/app/basic?nopromo

*Metrics API is odd in MLLib

2015-06-15 Thread Sam
Google+ https://plus.google.com/app/basic?nopromo=1source=moggl=uk http://mail.google.com/mail/x/mog-/gp/?source=moggl=uk Calendar https://www.google.com/calendar/gpcal?source=moggl=uk Web http://www.google.co.uk/?source=moggl=uk more Inbox Apache Spark Email GmailNot Work S

Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
read back the original data. Will try converting the str to bytearray before storing it to a seqeencefile. Thanks, Sam Stoelinga

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) On Tue, Jun 9, 2015 at 11:04 AM, Sam Stoelinga sammiest...@gmail.com wrote: Hi all, I'm storing an rdd as sequencefile with the following content: key=filename(string) value=python str from numpy.savez(not unicode

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
language usable SequenceFile instead of using Picklefile though, so if anybody has pointers would appreciate that :) On Tue, Jun 9, 2015 at 11:35 AM, Sam Stoelinga sammiest...@gmail.com wrote: Update: Using bytearray before storing to RDD is not a solution either. This happens when trying to read

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
. On Fri, Jun 5, 2015 at 2:17 PM, Sam Stoelinga sammiest...@gmail.com wrote: Yea should have emphasized that. I'm running the same code on the same VM. It's a VM with spark in standalone mode and I run the unit test directly on that same VM. So OpenCV is working correctly on that same machine

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
2, 2015 at 5:06 AM, Davies Liu dav...@databricks.com wrote: Could you run the single thread version in worker machine to make sure that OpenCV is installed and configured correctly? On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga sammiest...@gmail.com wrote: I've verified the issue lies

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
: Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote: I've changed the SIFT feature extraction to SURF feature extraction

  1   2   >