Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mike Metzger
I've not done this in Scala yet, but in PySpark I've run into a similar issue where having too many dataframes cached does cause memory issues. Unpersist by itself did not clear the memory usage, but rather setting the variable equal to None allowed all the references to be cleared and the memory

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger
Hi Mich - Can you run a filter command on df1 prior to your map for any rows where p(3).toString != '-' then run your map command? Thanks Mike On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh wrote: > Thanks guys > > Actually these are the 7 rogue rows. The

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-19 Thread Mike Metzger
While the SparkListener method is likely all around better, if you just need this quickly you should be able to do a SSH local port redirection over putty. In the putty configuration: - Go to Connection: SSH: Tunnels - In the Source port field, enter 4040 (or another unused port on your machine)

Re: year out of range

2016-09-08 Thread Mike Metzger
My guess is there's some row that does not match up with the expected data. While slower, I've found RDDs to be easier to troubleshoot this kind of thing until you sort out exactly what's happening. Something like: raw_data = sc.textFile("") rowcounts = raw_data.map(lambda x:

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mike Metzger
Hi Kevin - There's not really a race condition as the 64 bit value is split into a 31 bit partition id (the upper portion) and a 33 bit incrementing id. In other words, as long as each partition contains fewer than 8 billion entries there should be no overlap and there is not any

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
ta, based on Product=Item# > > > > So the resultant FCST DF should have data > > Product, fcst_qty > > A 85 > > B 35 > > > > Hope it helps > > > > If I join the data between the 2 DFs (based on Product# and item#),

Re: Please assist: Building Docker image containing spark 2.0

2016-08-26 Thread Mike Metzger
I would also suggest building the container manually first and setup everything you specifically need. Once done, you can then grab the history file, pull out the invalid commands and build out the completed Dockerfile. Trying to troubleshoot an installation via Dockerfile is often an exercise

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
based on the > sales qty (coming from the sales order DF) > > > > Hope it helps > > > > Subhajit > > > > *From:* Mike Metzger [mailto:m...@flexiblecreations.com] > *Sent:* Friday, August 26, 2016 1:13 PM > *To:* Subhajit Purkayastha <spurk...

Re: Spark 2.0 - Insert/Update to a DataFrame

2016-08-26 Thread Mike Metzger
Without seeing the makeup of the Dataframes nor what your logic is for updating them, I'd suggest doing a join of the Forecast DF with the appropriate columns from the SalesOrder DF. Mike On Fri, Aug 26, 2016 at 11:53 AM, Subhajit Purkayastha wrote: > I am using spark 2.0,

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
ich may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 25 August 2016 at 17:29, Mike Metzger <m...@flexiblecreations.com&

Re: UDF on lpad

2016-08-25 Thread Mike Metzger
Are you trying to always add x numbers of digits / characters or are you trying to pad to a specific length? If the latter, try using format strings: // Pad to 10 0 characters val c = 123 f"$c%010d" // Result is 000123 // Pad to 10 total characters with 0's val c = 123.87 f"$c%010.2f" //

Re: Sum array values by row in new column

2016-08-15 Thread Mike Metzger
Assuming you know the number of elements in the list, this should work: df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) + df["_1"].getItem(2)) Mike On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey wrote: > Hi everyone, > > I have one dataframe with one column

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 5 August 2016 at 16:45, Mike Met

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 5 August 2016 at 17:34, Mike Metzger <m...@flexiblecr

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
;>> @mike - this looks great. How can i do this in java ? what is the >>> performance implication on a large dataset ? >>> >>> @sonal - I can't have a collision in the values. >>> >>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
is the >> performance implication on a large dataset ? >> >> @sonal - I can't have a collision in the values. >> >>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com> >>> wrote: >>> You can use the monotonically_in

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
You can use the monotonically_increasing_id method to generate guaranteed unique (but not necessarily consecutive) IDs. Calling something like: df.withColumn("id", monotonically_increasing_id()) You don't mention which language you're using but you'll need to pull in the sql.functions

Re: Add column sum as new column in PySpark dataframe

2016-08-04 Thread Mike Metzger
This is a little ugly, but it may do what you're after - df.withColumn('total', expr("+".join([col for col in df.columns]))) I believe this will handle null values ok, but will likely error if there are any string columns present. Mike On Thu, Aug 4, 2016 at 8:41 AM, Javier Rey