Task - Id : Staus Failed

2019-06-06 Thread dimitris plakas
Hello Everyone,

I am trying to set up a yarn cluster with three nodes (one master and two
workers).
I followed this tutorial :
https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/


I also try to execute the yarn exmaple at the end of this tutorial with the
wordcount. After executing the hadoop-mapreduce-examples-2.8.5.jar i get
STATUS: FAILED for every Task Id although that the example finished without
any error.

The failed status means that an error occured on yarn job execution? If so
could you explain me what exactly is this error?

In the attachement you will find the output that i get to my screen.

Thank you in advance,
Dimitris Plakas
19/06/06 23:46:20 INFO client.RMProxy: Connecting to ResourceManager at 
node-master/192.168.0.1:8032
19/06/06 23:46:22 INFO input.FileInputFormat: Total input files to process : 3
19/06/06 23:46:23 INFO mapreduce.JobSubmitter: number of splits:3
19/06/06 23:46:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1559847675487_0010
19/06/06 23:46:24 INFO impl.YarnClientImpl: Submitted application 
application_1559847675487_0010
19/06/06 23:46:24 INFO mapreduce.Job: The url to track the job: 
http://node-master:8088/proxy/application_1559847675487_0010/
19/06/06 23:46:24 INFO mapreduce.Job: Running job: job_1559847675487_0010
19/06/06 23:46:38 INFO mapreduce.Job: Job job_1559847675487_0010 running in 
uber mode : false
19/06/06 23:46:38 INFO mapreduce.Job:  map 0% reduce 0%
19/06/06 23:46:48 INFO mapreduce.Job:  map 33% reduce 0%
19/06/06 23:46:55 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_m_02_0, Status : FAILED
19/06/06 23:46:55 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_m_01_0, Status : FAILED
19/06/06 23:47:04 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_r_00_0, Status : FAILED
19/06/06 23:47:05 INFO mapreduce.Job:  map 67% reduce 0%
19/06/06 23:47:11 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_m_01_1, Status : FAILED
19/06/06 23:47:21 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_r_00_1, Status : FAILED
19/06/06 23:47:25 INFO mapreduce.Job: Task Id : 
attempt_1559847675487_0010_m_01_2, Status : FAILED
19/06/06 23:47:44 INFO mapreduce.Job:  map 67% reduce 22%
19/06/06 23:47:45 INFO mapreduce.Job:  map 100% reduce 100%
19/06/06 23:47:46 INFO mapreduce.Job: Job job_1559847675487_0010 failed with 
state FAILED due to: Task failed task_1559847675487_0010_m_01
Job failed as tasks failed. failedMaps:1 failedReduces:0

19/06/06 23:47:46 INFO mapreduce.Job: Counters: 42
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1078223
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=394942
HDFS: Number of bytes written=0
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Failed reduce tasks=2
Killed map tasks=1
Killed reduce tasks=1
Launched map tasks=7
Launched reduce tasks=3
Other local map tasks=4
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=359388
Total time spent by all reduces in occupied slots (ms)=191720
Total time spent by all map tasks (ms)=89847
Total time spent by all reduce tasks (ms)=47930
Total vcore-milliseconds taken by all map tasks=89847
Total vcore-milliseconds taken by all reduce tasks=47930
Total megabyte-milliseconds taken by all map tasks=46001664
Total megabyte-milliseconds taken by all reduce tasks=24540160
Map-Reduce Framework
Map input records=2989
Map output records=7432
Map output bytes=746802
Map output materialized bytes=762467
Input split bytes=240
Combine input records=7432
Combine output records=7283
Spilled Records=7283
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=400
CPU time spent (ms)=5430
Physical memory (bytes) snapshot=554012672
Virtual memory (bytes) snapshot=3930853376
Total committed heap usage (bytes)=408944640
File Input Format Counters
Bytes Read=394702

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Yarn job is Stuck

2019-03-14 Thread dimitris plakas
Hello everyone,

I have set up a 3node hadoop cluster according to this tutorial:
https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#run-yarn
and i run the example about yarn (the one with the books) that is described
in this tutorial in order to test if everything works correctly. After
executing the command that executes the yarn job (e.g. yarn jar
/hadoop-mapreduce-examples-2.8.1.jar wordcount "books/*" output) i recieve
that job is running (screenshot2). But after i execute the command  yarn
application -list i see that the job is stuck in accepted status and that
it's progress is 0%.

Could you please help me on how to proceed ? Is there any more
configuration that is required about yarn?

My cluster has 3 nodes (one node-master and two workers). Their
specifications are:
Node-Master
CORES = 4
RAM = 8GB
DISK = 10GB

Each Worker has:
CORES = 4
RAM = 8GB
DISK = 20GB

Thank you in advance,


Apply Kmeans in partitions

2019-01-30 Thread dimitris plakas
Hello everyone,

I have a dataframe which has 5040 rows where these rows are splitted in 5
groups. So i have a column called "Group_Id" which marks every row with
values from 0-4 depending on in which group every rows belongs to. I am
trying to split my dataframe to 5 partitions and apply Kmeans to every
partition. I have tried

rdd=mydataframe.rdd.mapPartitions(function, True)
test = Kmeans.train(rdd, num_of_centers, "random")

but i get an error.

How can i apply Kmeans to every partition?

Thank you in advance,


Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone,

Here is an issue that i am facing in partitioning dtafarame.

I have a dataframe which called data_df. It is look like:

Group_Id | Object_Id | Trajectory
   1 |  obj1| Traj1
   2 |  obj2| Traj2
   1 |  obj3| Traj3
   3 |  obj4| Traj4
   2 |  obj5| Traj5

This dataframe has 5045 rows where each row has value in Group_Id from 1 to
7, and the number of rows per group_id is arbitrary.
I want to split the rdd which produced by from this dataframe in 7
partitions one for each group_id and then apply mapPartitions() where i
call function custom_func(). How can i create these partitions from this
dataframe? Should i first apply group by (create the grouped_df) in order
to create a dataframe with 7 rows and then call
partitioned_rdd=grouped_df.rdd.mapPartitions()?
Which is the optimal way to do it?

Thank you in advance


Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone,

I am trying to split a dataframe on partitions and i want to apply a custom
function on every partition. More precisely i have a dataframe like the one
below

Group_Id | Id | Points
1| id1| Point1
2| id2| Point2

I want to have a partition for every Group_Id and apply on every partition
a function defined by me.
I have tried with partitionBy('Group_Id').mapPartitions() but i receive
error.
Could you please advice me how to do it?


Error in show()

2018-09-06 Thread dimitris plakas
Hello everyone, I am new in Pyspark and i am facing an issue. Let me
explain what exactly is the problem.

I have a dataframe and i apply on this a map() function
(dataframe2=datframe1.rdd.map(custom_function())
dataframe = sqlContext.createDataframe(dataframe2)

when i have

dataframe.show(30,True) it shows the result,

when i am using dataframe.show(60, True) i get the error. The Error is in
the attachement Pyspark_Error.txt.

Could you please explain me what is this error and how to overpass it?
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
[Stage 13:===>(23 + 8) / 
47]2018-09-06 22:16:07 ERROR PythonRunner:91 - Python worker exited 
unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py",
 line 214, in main
  File 
"C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py",
 line 685, in read_int
raise EOFError
EOFError

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:428)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.api.python.SerDeUtil$Auto

Insert a pyspark dataframe in postgresql

2018-08-21 Thread dimitris plakas
Hello everyone here is a case that i am facing,

i have a pyspark application that as it's last step is to create a pyspark
dataframe with two columns
(column1, column2). This dataframe has only one row and i want this row to
be inserted in a postgres db table. In every run this line in the dataframe
may be different or the same.

I want after 10 runs to have 10 rows in my postgres table, so i want to
insert that dataframe in a postgres table in every run. What i have done up
to now is to use the below code but it doesn't insert the new row after
every run of my pyspark application, it just overwrites the old row.

Here is my code:

test_write_df.write.mode('append').options(url='jdbc:postgresql://localhost:5432/Test_Db',
dbtable='test_write_df',driver='org.postgresql.Driver',user='postgres',
password='my_password')

Can you please consult me if is this feasible to be happened and/or how to
achieve this?

Thank you in advance


DataTypes of an ArrayType

2018-07-11 Thread dimitris plakas
Hello everyone,

I am new to Pyspark and i would like to ask if there is any way to have a
Dataframe column which is ArrayType and have a different DataType for each
elemnt of the ArrayType. For example
to have something like :

StructType([StructField("Column_Name", ArrayType(ArrayType(FloatType(),
FloatType(), DecimalType(), False),False), False)]).

I want to have an ArrayType column with 2 elements as FloatType and 1
element as DecimalType

Thank you in advance


Convert scientific notation DecimalType

2018-07-10 Thread dimitris plakas
Hello everyone,

I am new in Pyspark and i am facing a problem in casting some values in
DecimalType. To clarify my question i present an example.

i have a dataframe in which i store my data which are some trajectories the
dataframe looks like

*Id | Trajectory*

id1 | [ [x1, y1, t1], [x2, y2, t2], ...[xn, yn, tn] ]
id2 | [ [x1, y1, t1], [x2, y2, t2], ...[xn, yn, tn] ]

So for each ID i have a trajectory after applying group by in a previous
dataframe.
x= lon
y=lat
t=time (in seconds).

When i store the data in the dataframe the time is displayed in scientific
notation. Is there any way to convert the t in DecimalType(10,0) ?

I tried to convert it in the beggining but i get an error when i try to
store the trajectories because the column Trajectory is
ArrayType(ArrayType(FloatType() ) ).


Create an Empty dataframe

2018-06-30 Thread dimitris plakas
I am new to Pyspark and want to initialize a new empty dataframe with
sqlContext() with two columns ("Column1", "Column2"), and i want to append
rows dynamically in a for loop.
Is there any way to achieve this?

Thank you in advance.


Connect to postgresql with pyspark

2018-04-29 Thread dimitris plakas
I am new in pyspark and i am learning it in order to complete my Thesis project 
in university. 

I am trying to create a dataframe by reading from a postgresql database table, 
but i am facing a problem when i try to connect my pyspark application with 
postgresql db server. Could you please explain me the steps that are required 
in order to have a successfull  connection with the database? I am using python 
2.7, spark-2.3.0-bin-hadoop2.7,  pycharm IDE and windows environmen.

What i have done is that i have launched a pyspark shell with --jars /path to 
postgresql jar/ and the 
df = 
sqlContext.read.jdbc(url='jdbc:postgresql://localhost:port/[database]?user='username'&password='paswd',
 table='table name')


Sent from Mail for Windows 10