Create an Empty dataframe

2018-06-30 Thread dimitris plakas
I am new to Pyspark and want to initialize a new empty dataframe with sqlContext() with two columns ("Column1", "Column2"), and i want to append rows dynamically in a for loop. Is there any way to achieve this? Thank you in advance.

Connect to postgresql with pyspark

2018-04-29 Thread dimitris plakas
I am new in pyspark and i am learning it in order to complete my Thesis project in university.  I am trying to create a dataframe by reading from a postgresql database table, but i am facing a problem when i try to connect my pyspark application with postgresql db server. Could you please

Insert a pyspark dataframe in postgresql

2018-08-21 Thread dimitris plakas
Hello everyone here is a case that i am facing, i have a pyspark application that as it's last step is to create a pyspark dataframe with two columns (column1, column2). This dataframe has only one row and i want this row to be inserted in a postgres db table. In every run this line in the

Error in show()

2018-09-06 Thread dimitris plakas
Hello everyone, I am new in Pyspark and i am facing an issue. Let me explain what exactly is the problem. I have a dataframe and i apply on this a map() function (dataframe2=datframe1.rdd.map(custom_function()) dataframe = sqlContext.createDataframe(dataframe2) when i have

Convert scientific notation DecimalType

2018-07-10 Thread dimitris plakas
Hello everyone, I am new in Pyspark and i am facing a problem in casting some values in DecimalType. To clarify my question i present an example. i have a dataframe in which i store my data which are some trajectories the dataframe looks like *Id | Trajectory* id1 | [ [x1, y1, t1], [x2, y2,

DataTypes of an ArrayType

2018-07-11 Thread dimitris plakas
Hello everyone, I am new to Pyspark and i would like to ask if there is any way to have a Dataframe column which is ArrayType and have a different DataType for each elemnt of the ArrayType. For example to have something like : StructType([StructField("Column_Name",

Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone, I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below Group_Id | Id | Points 1| id1| Point1 2| id2| Point2 I want to have a partition for every

Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone, Here is an issue that i am facing in partitioning dtafarame. I have a dataframe which called data_df. It is look like: Group_Id | Object_Id | Trajectory 1 | obj1| Traj1 2 | obj2| Traj2 1 | obj3| Traj3 3 |

Yarn job is Stuck

2019-03-14 Thread dimitris plakas
Hello everyone, I have set up a 3node hadoop cluster according to this tutorial: https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#run-yarn and i run the example about yarn (the one with the books) that is described in this tutorial in order to test if everything

Apply Kmeans in partitions

2019-01-30 Thread dimitris plakas
Hello everyone, I have a dataframe which has 5040 rows where these rows are splitted in 5 groups. So i have a column called "Group_Id" which marks every row with values from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and apply Kmeans

Task - Id : Staus Failed

2019-06-06 Thread dimitris plakas
will find the output that i get to my screen. Thank you in advance, Dimitris Plakas 19/06/06 23:46:20 INFO client.RMProxy: Connecting to ResourceManager at node-master/192.168.0.1:8032 19/06/06 23:46:22 INFO input.FileInputFormat: Total input files to process : 3 19/06/06 23:46:23 INFO