Re: Read Time from a remote data source

2018-12-19 Thread Jiaan Geng
First, Spark worker not have the ability to compute.In fact,executor is responsible for computation. Executor running tasks is distributed by driver. Each Task just read some section of data in normal, but the stage have only one partition. IF your operators not contains the operator that will

Re: Read Time from a remote data source

2018-12-19 Thread swastik mittal
I am running a model where the workers should not have the data stored in them. They are only for execution purpose. The other cluster (its just a single node) which I am receiving data from is just acting as a file server, for which I could have used any other way like nfs or ftp. So I went with

Re: Read Time from a remote data source

2018-12-18 Thread jiaan.geng
You said your hdfs cluster and spark cluster is running on different cluster.This is not a good idea,because you should consider data locality.Your spark node need config hdfs client configuration. Spark Job is composed of stages,each stage have one or more partitions。Parallelism of job decided by

Read Time from a remote data source

2018-12-18 Thread swastik mittal
Hi, I am new to spark. I am running a hdfs file system on a remote cluster whereas my spark workers are on another cluster. When my textFile RDD gets executed, does spark worker read from the file according to hdfs partitions task by task, or do they read it once when the blockmanager sets after