[ https://issues.apache.org/jira/browse/SPARK-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Min Li updated SPARK-1967: -------------------------- Description: I was trying the parallelize method to create RDD. I used Java. And it's a simple wordcount program, except that I first read the input into memory and then use the parallelize method to create the RDD, rather than the default textFile method in the given example. Pseudo codes: JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, $jars); List<String> input = #read lines from input file and form a ArrayList<String> JavaRDD lines = ctx.parallelize(input); //followed by wordcount ----above is not working. JavaRDD lines = ctx.textFile(file); //followed by wordcount ----this is working The log is: 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started 14/05/29 16:18:43 INFO Remoting: Starting remoting 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140529161843-836a 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB. 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = ConnectionManagerId(spark,42942) 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager spark:42942 with 1056.0 MB RAM 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at http://10.227.119.185:43522 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040 14/05/29 16:18:44 INFO SparkContext: Added JAR /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1401394724045 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master spark://spark:7077... 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140529161844-0001 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) with 8 cores The app is hanging here forever. And spark:8080 spark:4040 are not showing any strange info. The Spark Stages page shows the Active Stages is reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after parallelize, and the app will stuck at the count stage. I've also tried to use some static give string list and use the parallelize to create rdd. This time, the app is still hanging but the stages show nothing active. And the log is similar. I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have only one host. I used maven to compile a fat jar with spark specified as provided. I modified the run-example script to submit the jar. was: I was trying the parallelize method to create RDD. I used Java. And it's a simple wordcount program, except that I first read the input into memory and then use the parallelize method to create the RDD, rather than the default textFile method in the given example. Pseudo codes: JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, $jars); List<String> input = #read lines from input file and form a ArrayList<String> JavaRDD lines = ctx.parallelize(input); //followed by wordcount ----above is not working. JavaRDD lines = ctx.textFile(file); //followed by wordcount ----this is working The log is: 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started 14/05/29 16:18:43 INFO Remoting: Starting remoting 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140529161843-836a 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB. 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = ConnectionManagerId(spark,42942) 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager spark:42942 with 1056.0 MB RAM 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at http://10.227.119.185:43522 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040 14/05/29 16:18:44 INFO SparkContext: Added JAR /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1401394724045 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master spark://spark:7077... 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140529161844-0001 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) with 8 cores The app is hanging here forever. And spark:8080 spark:4040 are not showing any strange info. The Spark Stages page shows the Active Stages is reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after parallelize, and the app will stuck at the count stage. I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have only one host. I used maven to compile a fat jar with spark specified as provided. I modified the run-example script to submit the jar. > Using parallelize method to create RDD, wordcount app just hanging there > without errors or warnings > --------------------------------------------------------------------------------------------------- > > Key: SPARK-1967 > URL: https://issues.apache.org/jira/browse/SPARK-1967 > Project: Spark > Issue Type: Bug > Affects Versions: 0.9.1 > Environment: Ubuntu-12.04, single machine spark standalone, 8 core, > 8G mem, spark 0.9.1, java-1.7 > Reporter: Min Li > > I was trying the parallelize method to create RDD. I used Java. And it's a > simple wordcount program, except that I first read the input into memory and > then use the parallelize method to create the RDD, rather than the default > textFile method in the given example. > Pseudo codes: > JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, > $SparkHome, $jars); > List<String> input = #read lines from input file and form a ArrayList<String> > JavaRDD lines = ctx.parallelize(input); > //followed by wordcount > ----above is not working. > JavaRDD lines = ctx.textFile(file); > //followed by wordcount > ----this is working > The log is: > 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started > 14/05/29 16:18:43 INFO Remoting: Starting remoting > 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@spark:55224] > 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@spark:55224] > 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster > 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140529161843-836a > 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 > MB. > 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id > = ConnectionManagerId(spark,42942) > 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager > 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering > block manager spark:42942 with 1056.0 MB RAM > 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager > 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server > 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at > http://10.227.119.185:43522 > 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker > 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e > 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server > 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040 > 14/05/29 16:18:44 INFO SparkContext: Added JAR > /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at > http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar > with timestamp 1401394724045 > 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master > spark://spark:7077... > 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark > cluster with app ID app-20140529161844-0001 > 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: > app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) > with 8 cores > The app is hanging here forever. And spark:8080 spark:4040 are not showing > any strange info. The Spark Stages page shows the Active Stages is > reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call > lines.count after parallelize, and the app will stuck at the count stage. > I've also tried to use some static give string list and use the parallelize > to create rdd. This time, the app is still hanging but the stages show > nothing active. And the log is similar. > I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have > only one host. I used maven to compile a fat jar with spark specified as > provided. I modified the run-example script to submit the jar. -- This message was sent by Atlassian JIRA (v6.2#6252)