Task - Id : Staus Failed
Hello Everyone, I am trying to set up a yarn cluster with three nodes (one master and two workers). I followed this tutorial : https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/ I also try to execute the yarn exmaple at the end of this tutorial with the wordcount. After executing the hadoop-mapreduce-examples-2.8.5.jar i get STATUS: FAILED for every Task Id although that the example finished without any error. The failed status means that an error occured on yarn job execution? If so could you explain me what exactly is this error? In the attachement you will find the output that i get to my screen. Thank you in advance, Dimitris Plakas 19/06/06 23:46:20 INFO client.RMProxy: Connecting to ResourceManager at node-master/192.168.0.1:8032 19/06/06 23:46:22 INFO input.FileInputFormat: Total input files to process : 3 19/06/06 23:46:23 INFO mapreduce.JobSubmitter: number of splits:3 19/06/06 23:46:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1559847675487_0010 19/06/06 23:46:24 INFO impl.YarnClientImpl: Submitted application application_1559847675487_0010 19/06/06 23:46:24 INFO mapreduce.Job: The url to track the job: http://node-master:8088/proxy/application_1559847675487_0010/ 19/06/06 23:46:24 INFO mapreduce.Job: Running job: job_1559847675487_0010 19/06/06 23:46:38 INFO mapreduce.Job: Job job_1559847675487_0010 running in uber mode : false 19/06/06 23:46:38 INFO mapreduce.Job: map 0% reduce 0% 19/06/06 23:46:48 INFO mapreduce.Job: map 33% reduce 0% 19/06/06 23:46:55 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_m_02_0, Status : FAILED 19/06/06 23:46:55 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_m_01_0, Status : FAILED 19/06/06 23:47:04 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_r_00_0, Status : FAILED 19/06/06 23:47:05 INFO mapreduce.Job: map 67% reduce 0% 19/06/06 23:47:11 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_m_01_1, Status : FAILED 19/06/06 23:47:21 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_r_00_1, Status : FAILED 19/06/06 23:47:25 INFO mapreduce.Job: Task Id : attempt_1559847675487_0010_m_01_2, Status : FAILED 19/06/06 23:47:44 INFO mapreduce.Job: map 67% reduce 22% 19/06/06 23:47:45 INFO mapreduce.Job: map 100% reduce 100% 19/06/06 23:47:46 INFO mapreduce.Job: Job job_1559847675487_0010 failed with state FAILED due to: Task failed task_1559847675487_0010_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 19/06/06 23:47:46 INFO mapreduce.Job: Counters: 42 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1078223 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=394942 HDFS: Number of bytes written=0 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Failed map tasks=5 Failed reduce tasks=2 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=7 Launched reduce tasks=3 Other local map tasks=4 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=359388 Total time spent by all reduces in occupied slots (ms)=191720 Total time spent by all map tasks (ms)=89847 Total time spent by all reduce tasks (ms)=47930 Total vcore-milliseconds taken by all map tasks=89847 Total vcore-milliseconds taken by all reduce tasks=47930 Total megabyte-milliseconds taken by all map tasks=46001664 Total megabyte-milliseconds taken by all reduce tasks=24540160 Map-Reduce Framework Map input records=2989 Map output records=7432 Map output bytes=746802 Map output materialized bytes=762467 Input split bytes=240 Combine input records=7432 Combine output records=7283 Spilled Records=7283 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=400 CPU time spent (ms)=5430 Physical memory (bytes) snapshot=554012672 Virtual memory (bytes) snapshot=3930853376 Total committed heap usage (bytes)=408944640 File Input Format Counters Bytes Read=394702 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Yarn job is Stuck
Hello everyone, I have set up a 3node hadoop cluster according to this tutorial: https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#run-yarn and i run the example about yarn (the one with the books) that is described in this tutorial in order to test if everything works correctly. After executing the command that executes the yarn job (e.g. yarn jar /hadoop-mapreduce-examples-2.8.1.jar wordcount "books/*" output) i recieve that job is running (screenshot2). But after i execute the command yarn application -list i see that the job is stuck in accepted status and that it's progress is 0%. Could you please help me on how to proceed ? Is there any more configuration that is required about yarn? My cluster has 3 nodes (one node-master and two workers). Their specifications are: Node-Master CORES = 4 RAM = 8GB DISK = 10GB Each Worker has: CORES = 4 RAM = 8GB DISK = 20GB Thank you in advance,
Apply Kmeans in partitions
Hello everyone, I have a dataframe which has 5040 rows where these rows are splitted in 5 groups. So i have a column called "Group_Id" which marks every row with values from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and apply Kmeans to every partition. I have tried rdd=mydataframe.rdd.mapPartitions(function, True) test = Kmeans.train(rdd, num_of_centers, "random") but i get an error. How can i apply Kmeans to every partition? Thank you in advance,
Pyspark Partitioning
Hello everyone, Here is an issue that i am facing in partitioning dtafarame. I have a dataframe which called data_df. It is look like: Group_Id | Object_Id | Trajectory 1 | obj1| Traj1 2 | obj2| Traj2 1 | obj3| Traj3 3 | obj4| Traj4 2 | obj5| Traj5 This dataframe has 5045 rows where each row has value in Group_Id from 1 to 7, and the number of rows per group_id is arbitrary. I want to split the rdd which produced by from this dataframe in 7 partitions one for each group_id and then apply mapPartitions() where i call function custom_func(). How can i create these partitions from this dataframe? Should i first apply group by (create the grouped_df) in order to create a dataframe with 7 rows and then call partitioned_rdd=grouped_df.rdd.mapPartitions()? Which is the optimal way to do it? Thank you in advance
Pyspark Partitioning
Hello everyone, I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below Group_Id | Id | Points 1| id1| Point1 2| id2| Point2 I want to have a partition for every Group_Id and apply on every partition a function defined by me. I have tried with partitionBy('Group_Id').mapPartitions() but i receive error. Could you please advice me how to do it?
Error in show()
Hello everyone, I am new in Pyspark and i am facing an issue. Let me explain what exactly is the problem. I have a dataframe and i apply on this a map() function (dataframe2=datframe1.rdd.map(custom_function()) dataframe = sqlContext.createDataframe(dataframe2) when i have dataframe.show(30,True) it shows the result, when i am using dataframe.show(60, True) i get the error. The Error is in the attachement Pyspark_Error.txt. Could you please explain me what is this error and how to overpass it? To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [Stage 13:===>(23 + 8) / 47]2018-09-06 22:16:07 ERROR PythonRunner:91 - Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\worker.py", line 214, in main File "C:\Users\dimitris\PycharmProjects\untitled\venv\Lib\site-packages\pyspark\python\lib\pyspark.zip\pyspark\serializers.py", line 685, in read_int raise EOFError EOFError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:210) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:428) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$Auto
Insert a pyspark dataframe in postgresql
Hello everyone here is a case that i am facing, i have a pyspark application that as it's last step is to create a pyspark dataframe with two columns (column1, column2). This dataframe has only one row and i want this row to be inserted in a postgres db table. In every run this line in the dataframe may be different or the same. I want after 10 runs to have 10 rows in my postgres table, so i want to insert that dataframe in a postgres table in every run. What i have done up to now is to use the below code but it doesn't insert the new row after every run of my pyspark application, it just overwrites the old row. Here is my code: test_write_df.write.mode('append').options(url='jdbc:postgresql://localhost:5432/Test_Db', dbtable='test_write_df',driver='org.postgresql.Driver',user='postgres', password='my_password') Can you please consult me if is this feasible to be happened and/or how to achieve this? Thank you in advance
DataTypes of an ArrayType
Hello everyone, I am new to Pyspark and i would like to ask if there is any way to have a Dataframe column which is ArrayType and have a different DataType for each elemnt of the ArrayType. For example to have something like : StructType([StructField("Column_Name", ArrayType(ArrayType(FloatType(), FloatType(), DecimalType(), False),False), False)]). I want to have an ArrayType column with 2 elements as FloatType and 1 element as DecimalType Thank you in advance
Convert scientific notation DecimalType
Hello everyone, I am new in Pyspark and i am facing a problem in casting some values in DecimalType. To clarify my question i present an example. i have a dataframe in which i store my data which are some trajectories the dataframe looks like *Id | Trajectory* id1 | [ [x1, y1, t1], [x2, y2, t2], ...[xn, yn, tn] ] id2 | [ [x1, y1, t1], [x2, y2, t2], ...[xn, yn, tn] ] So for each ID i have a trajectory after applying group by in a previous dataframe. x= lon y=lat t=time (in seconds). When i store the data in the dataframe the time is displayed in scientific notation. Is there any way to convert the t in DecimalType(10,0) ? I tried to convert it in the beggining but i get an error when i try to store the trajectories because the column Trajectory is ArrayType(ArrayType(FloatType() ) ).
Create an Empty dataframe
I am new to Pyspark and want to initialize a new empty dataframe with sqlContext() with two columns ("Column1", "Column2"), and i want to append rows dynamically in a for loop. Is there any way to achieve this? Thank you in advance.
Connect to postgresql with pyspark
I am new in pyspark and i am learning it in order to complete my Thesis project in university. I am trying to create a dataframe by reading from a postgresql database table, but i am facing a problem when i try to connect my pyspark application with postgresql db server. Could you please explain me the steps that are required in order to have a successfull connection with the database? I am using python 2.7, spark-2.3.0-bin-hadoop2.7, pycharm IDE and windows environmen. What i have done is that i have launched a pyspark shell with --jars /path to postgresql jar/ and the df = sqlContext.read.jdbc(url='jdbc:postgresql://localhost:port/[database]?user='username'&password='paswd', table='table name') Sent from Mail for Windows 10