Thanks Marcelo, Let me spin this towards a parallel trajectory then, as the title change implies. I think I will further read some of the articles at https://spark.apache.org/research.html but basically, I understand Spark keeps the data in-memory, and only pulls from hdfs, or at most writes the final output of a job back to it, rather than depositing intermediary step outputs to hdfs files like hadoop map reduce would typically do (?).
Just a small departing question then - does it also work with the *Gluster* or *Ceph* distributed file systems, not just hdfs? reading some of the documentation I think if my Scala code can read a file from either of those, and I have a Spark standalone cluster, or one managed by Mesos, then I am not bound to hdfs nor to hadoop.. I assume that architecture will consume a lot of bandwidth pulling the inputs from the file system, and not leverage the placement of its workers/executors along with the data nodes as it does with hadoop/hdfs (... thus acting as a compute cluster that takes very long to bootstrap an application). Perhaps however, hadoop's ability to work over GlusterFS or CephFS would provide that data-locality benefit after all? or is it bound specifically to the hdfs api of hadoop, for performing a local data pull on the storage cluster machines? Thanks, Matan On Fri, Oct 24, 2014 at 4:19 AM, Marcelo Vanzin [via Apache Spark User List] <ml-node+s1001560n17170...@n3.nabble.com> wrote: > You assessment is mostly correct. I think the only thing I'd reword is > the comment about splitting the data, since Spark itself doesn't do > that, but read on. > > On Thu, Oct 23, 2014 at 6:12 PM, matan <[hidden email] > <http://user/SendEmail.jtp?type=node&node=17170&i=0>> wrote: > > In case I nailed it, how then does it handle a distributed hdfs file? > does > > it pull all of the file to/through one Spark server > > Well, Spark here just piggybacks on what HDFS already gives you, since > it's a distributed file system. In HDFS, files are broken into blocks > and each block is stored in one or more machine. Spark uses Hadoop > classes that understand this and give you the information about where > those blocks are. > > If there are Spark executors on those machines holding the blocks, > Spark will try to run tasks on those executors. Otherwise, it will > assign some other executor to do the computation, and that executor > will pull that particular block from HDFS over the network. > > It can be a lot more complicated than that (since each file format may > have different ways of partitioning data, or you can create your own > way, or repartition data, or Spark may give up waiting for the right > executor, or...), but that's a good first start. > > -- > Marcelo > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=17170&i=1> > For additional commands, e-mail: [hidden email] > <http://user/SendEmail.jtp?type=node&node=17170&i=2> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169p17170.html > To unsubscribe from Spark using HDFS data [newb], click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=17169&code=ZGV2Lm1hdGFuQGdtYWlsLmNvbXwxNzE2OXwtMTk5NzYyNTg2NQ==> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-using-non-HDFS-data-on-a-distributed-file-system-cluster-tp17239.html Sent from the Apache Spark User List mailing list archive at Nabble.com.