Unsubscribe

2020-05-05 Thread Zeming Yu
Unsubscribe Get Outlook for Android

unsubscribe

2020-04-29 Thread Zeming Yu
unsubscribe Get Outlook for Android

[no subject]

2020-04-28 Thread Zeming Yu
Unsubscribe Get Outlook for Android

Re: examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Sorry, sent the incomplete email by mistake. Here's the full email: > Hi, > > I need to flatten a nested dataframe and I' following this example: > https://docs.databricks.com/spark/latest/spark-sql/complex-types.html > > Just wondering: > 1. how can I test for the existence of an item before

examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Hi, I need to flatten a nested dataframe and I' following this example: https://docs.databricks.com/spark/latest/spark-sql/complex-types.html Just wondering: 1. how can I test for the existence of an item before retrieving it Say test if "b" exists before adding that into my flat dataframe

using pandas and pyspark to run ETL job - always failing after about 40 minutes

2017-05-26 Thread Zeming Yu
Hi, I tried running the ETL job a few times. It always fails after 40 minutes or so. When I relaunch jupyter and rerun the job, it runs without error. Then it fails again after some time. Just wondering if anyone else has encountered this before? Here's the error message:

Re: what does this error mean?

2017-05-13 Thread Zeming Yu
969 logger.exception(msg)--> 970 raise Py4JNetworkError(msg, e)971 972 def close(self, reset=False): Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:34166) On Sat, May 13, 2017 at 10:21 PM, Zeming Yu <zemin...@gmail.co

what does this error mean?

2017-05-13 Thread Zeming Yu
My code runs error free on my local pc. Just tried running the same code on a ubuntu machine on ec2, and got the error below. Any idea where to start in terms of debugging? ---Py4JError

how to set up h2o sparkling water on jupyter notebook on a windows machine

2017-05-08 Thread Zeming Yu
Hi, I'm a newbie, so please bear with me. *I'm using a windows 10 machine. I installed spark here:* C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7 *I also installed h2o sparkling water here:* C:\sparkling-water-2.1.1 *I use this code in command line to launch a jupyter notebook for

how to check whether spill over to hard drive happened or not

2017-05-06 Thread Zeming Yu
hi, I'm running pyspark on my local PC using the stand alone mode. After a pyspark window function on a dataframe, I did a groupby query on the dataframe. The groupby query turns out to be very slow (10+ minutes on a small data set). I then cached the dataframe and re-ran the same query. The

Re: take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
OK. I've worked it out. df.withColumn('diff', col('A')-col('B')) On Sun, May 7, 2017 at 11:49 AM, Zeming Yu <zemin...@gmail.com> wrote: > Say I have the following dataframe with two numeric columns A and B, > what's the best way to add a column showing the difference between the t

take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
Say I have the following dataframe with two numeric columns A and B, what's the best way to add a column showing the difference between the two columns? +-+--+ |A| B| +-+--+ |786.31999|786.12| | 786.12|

Spark books

2017-05-03 Thread Zeming Yu
I'm trying to decide whether to buy the book learning spark, spark for machine learning etc. or wait for a new edition covering the new concepts like dataframe and datasets. Anyone got any suggestions?

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
ous to see what others > have to say. > Would it be more efficient if a relational database with the right index > (code field in the above case) to perform more efficiently (with spark that > uses predicate push-down)? > Hope this helps. > > Thanks, > Muthu > > On Sun, Ap

examples of dealing with nested parquet/ dataframe file

2017-04-30 Thread Zeming Yu
Hi, I'm still trying to decide whether to store my data as deeply nested or flat parquet file. The main reason for storing the nested file is it stores data in its raw format, no information loss. I have two questions: 1. Is it always necessary to flatten a nested dataframe for the purpose of

Re: Recommended cluster parameters

2017-04-30 Thread Zeming Yu
I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required? Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data,

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
PM, Jörn Franke <jornfra...@gmail.com> wrote: > Depends on your queries, the data structure etc. generally flat is better, > but if your query filter is on the highest level then you may have better > performance with a nested structure, but it really depends > > > On 30. Apr 2017,

parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi, We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct? Thanks, Zeming

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
ocs/latest/api/python/index.html ​ > > > Thank you, > *Pushkar Gujar* > > > On Tue, Apr 25, 2017 at 8:50 AM, Zeming Yu <zemin...@gmail.com> wrote: > >> How could I access the first element of the holiday column? >> >> I tried the following code

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
How could I access the first element of the holiday column? I tried the following code, but it doesn't work: start_date_test2.withColumn("diff", datediff(start_date_test2.start_date, start_date_test2.holiday*[0]*)).show() On Tue, Apr 25, 2017 at 10:20 PM, Zeming Yu <zemin...@gma

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
() > > Should transfer string to Date type when compare. > > Yu Wenpei. > > > ----- Original message - > From: Zeming Yu <zemin...@gmail.com> > To: user <user@spark.apache.org> > Cc: > Subject: how to find the nearest holiday > Date: Tue, Apr 25, 2017

how to find the nearest holiday

2017-04-25 Thread Zeming Yu
I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below? start_date_test = flight3.select("start_date").distinct() start_date_test.show() holidays = ['2017-09-01', '2017-10-01'] +--+ |start_date| +--+

Re: udf that handles null values

2017-04-24 Thread Zeming Yu
issue today at stackoverflow. > > http://stackoverflow.com/questions/43595201/python-how- > to-convert-pyspark-column-to-date-type-if-there-are-null- > values/43595728#43595728 > > > Thank you, > *Pushkar Gujar* > > > On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <zemin...@gm

one hot encode a column of vector

2017-04-24 Thread Zeming Yu
how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array? from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer indexers =

pyspark vector

2017-04-24 Thread Zeming Yu
Hi all, Beginner question: what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? https://spark.apache.org/docs/2.1.0/ml-features.html id | texts | vector |-|--- 0 | Array("a", "b", "c")|

udf that handles null values

2017-04-24 Thread Zeming Yu
hi all, I tried to write a UDF that handles null values: def getMinutes(hString, minString): if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1]) else: return None flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h",

Re: how to add new column using regular expression within pyspark dataframe

2017-04-22 Thread Zeming Yu
h').getItem(0)) > > > Thank you, > *Pushkar Gujar* > > > On Thu, Apr 20, 2017 at 4:35 AM, Zeming Yu <zemin...@gmail.com> wrote: > >> Any examples? >> >> On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)" <facai@gmail.com> wrote: >>

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Zeming Yu
t; > + https://ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example/ > > > > On Mon, Apr 17, 2017 at 8:25 PM, Zeming Yu <zemin...@gmail.com> wrote: > >> I've got a dataframe with a column looking like this: >> >> display(flight.select("durati

how to add new column using regular expression within pyspark dataframe

2017-04-17 Thread Zeming Yu
I've got a dataframe with a column looking like this: display(flight.select("duration").show()) ++ |duration| ++ | 15h10m| | 17h0m| | 21h25m| | 14h30m| | 24h50m| | 26h10m| | 14h30m| | 23h5m| | 21h30m| | 11h50m| | 16h10m| | 15h15m| | 21h25m| | 14h25m| | 14h40m| |

Re: optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
quot; <ste...@hortonworks.com> wrote: > > > On 11 Apr 2017, at 11:07, Zeming Yu <zemin...@gmail.com> wrote: > > > > Hi all, > > > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > &g

optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
Hi all, I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have. Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store