[ 
https://issues.apache.org/jira/browse/SPARK-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Li updated SPARK-1967:
--------------------------

    Description: 
I was trying the parallelize method to create RDD. I used Java. And it's a 
simple wordcount program, except that I first read the input into memory and 
then use the parallelize method to create the RDD, rather than the default 
textFile method in the given example. 
Pseudo codes:
JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, 
$jars);
List<String> input = #read lines from input file and form a ArrayList<String>
JavaRDD lines = ctx.parallelize(input);
//followed by wordcount
----above is not working.
JavaRDD lines = ctx.textFile(file);
//followed by wordcount
----this is working

The log is:
14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started
14/05/29 16:18:43 INFO Remoting: Starting remoting
14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster
14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140529161843-836a
14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB.
14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = 
ConnectionManagerId(spark,42942)
14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager
14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager spark:42942 with 1056.0 MB RAM
14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at 
http://10.227.119.185:43522
14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker
14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040
14/05/29 16:18:44 INFO SparkContext: Added JAR 
/home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at 
http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
 with timestamp 1401394724045
14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master 
spark://spark:7077...
14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140529161844-0001
14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: 
app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) 
with 8 cores

The app is hanging here forever. And spark:8080 spark:4040 are not showing any 
strange info. The Spark Stages page shows the Active Stages is reduceByKey, 
tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after 
parallelize, and the app will stuck at the count stage.

I've also tried to use some static give string list and use the parallelize to 
create rdd. This time, the app is still hanging but the stages show nothing 
active. And the log is similar. 

I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have 
only one host. I used maven to compile a fat jar with spark specified as 
provided. I modified the run-example script to submit the jar.

  was:
I was trying the parallelize method to create RDD. I used Java. And it's a 
simple wordcount program, except that I first read the input into memory and 
then use the parallelize method to create the RDD, rather than the default 
textFile method in the given example. 
Pseudo codes:
JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, 
$jars);
List<String> input = #read lines from input file and form a ArrayList<String>
JavaRDD lines = ctx.parallelize(input);
//followed by wordcount
----above is not working.
JavaRDD lines = ctx.textFile(file);
//followed by wordcount
----this is working

The log is:
14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started
14/05/29 16:18:43 INFO Remoting: Starting remoting
14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@spark:55224]
14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster
14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20140529161843-836a
14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB.
14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = 
ConnectionManagerId(spark,42942)
14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager
14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
block manager spark:42942 with 1056.0 MB RAM
14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at 
http://10.227.119.185:43522
14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker
14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e
14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040
14/05/29 16:18:44 INFO SparkContext: Added JAR 
/home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at 
http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
 with timestamp 1401394724045
14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master 
spark://spark:7077...
14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20140529161844-0001
14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: 
app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) 
with 8 cores

The app is hanging here forever. And spark:8080 spark:4040 are not showing any 
strange info. The Spark Stages page shows the Active Stages is reduceByKey, 
tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after 
parallelize, and the app will stuck at the count stage.

I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have 
only one host. I used maven to compile a fat jar with spark specified as 
provided. I modified the run-example script to submit the jar.


> Using parallelize method to create RDD, wordcount app just hanging there 
> without errors or warnings
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-1967
>                 URL: https://issues.apache.org/jira/browse/SPARK-1967
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>         Environment: Ubuntu-12.04, single machine spark standalone, 8 core, 
> 8G mem, spark 0.9.1, java-1.7
>            Reporter: Min Li
>
> I was trying the parallelize method to create RDD. I used Java. And it's a 
> simple wordcount program, except that I first read the input into memory and 
> then use the parallelize method to create the RDD, rather than the default 
> textFile method in the given example. 
> Pseudo codes:
> JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, 
> $SparkHome, $jars);
> List<String> input = #read lines from input file and form a ArrayList<String>
> JavaRDD lines = ctx.parallelize(input);
> //followed by wordcount
> ----above is not working.
> JavaRDD lines = ctx.textFile(file);
> //followed by wordcount
> ----this is working
> The log is:
> 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started
> 14/05/29 16:18:43 INFO Remoting: Starting remoting
> 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@spark:55224]
> 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@spark:55224]
> 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster
> 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140529161843-836a
> 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 
> MB.
> 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id 
> = ConnectionManagerId(spark,42942)
> 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager
> 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
> block manager spark:42942 with 1056.0 MB RAM
> 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager
> 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
> 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at 
> http://10.227.119.185:43522
> 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker
> 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e
> 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server
> 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040
> 14/05/29 16:18:44 INFO SparkContext: Added JAR 
> /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at 
> http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar
>  with timestamp 1401394724045
> 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master 
> spark://spark:7077...
> 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark 
> cluster with app ID app-20140529161844-0001
> 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: 
> app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) 
> with 8 cores
> The app is hanging here forever. And spark:8080 spark:4040 are not showing 
> any strange info. The Spark Stages page shows the Active Stages is 
> reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call 
> lines.count after parallelize, and the app will stuck at the count stage.
> I've also tried to use some static give string list and use the parallelize 
> to create rdd. This time, the app is still hanging but the stages show 
> nothing active. And the log is similar. 
> I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have 
> only one host. I used maven to compile a fat jar with spark specified as 
> provided. I modified the run-example script to submit the jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to