Re: Read Time from a remote data source
First, Spark worker not have the ability to compute.In fact,executor is responsible for computation. Executor running tasks is distributed by driver. Each Task just read some section of data in normal, but the stage have only one partition. IF your operators not contains the operator that will pull middle result from each task, like collect or show,driver will not store any data. Each Executor not store the end result in memory by default, unless your operator contains the operator that cache data to memory, like cache or persist. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Read Time from a remote data source
I am running a model where the workers should not have the data stored in them. They are only for execution purpose. The other cluster (its just a single node) which I am receiving data from is just acting as a file server, for which I could have used any other way like nfs or ftp. So I went with hdfs so that it would not have to worry about partitioning of data and also it does not effect my experiment. So I just had this question that does spark worker read all the data before computation once its first task start, and then distribute it among the workers memory or do they read it chunk by chunk, by each worker and then store the end result in memory to send the final result. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Read Time from a remote data source
You said your hdfs cluster and spark cluster is running on different cluster.This is not a good idea,because you should consider data locality.Your spark node need config hdfs client configuration. Spark Job is composed of stages,each stage have one or more partitions。Parallelism of job decided by these partitions. Shuffle process is decided by your operator,like reduceByKey、repartition、sortBy and so on. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Read Time from a remote data source
Hi, I am new to spark. I am running a hdfs file system on a remote cluster whereas my spark workers are on another cluster. When my textFile RDD gets executed, does spark worker read from the file according to hdfs partitions task by task, or do they read it once when the blockmanager sets after the start of first task and distributes it among the memory of spark cluster? I have this question because I have a situation where, when I have only one worker executing a job it shows less run time per task (shown in history server) then when I have two workers executing the same job in parallel. Even though the total duration is almost the same. I am running a simple grep application and no shuffles within the cluster. Text file is on a remote hdfs cluster and is of 813MB distributed into 7 chunks of 128MB, last chunk is left over size. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org