Re: Spark using non-HDFS data on a distributed file system cluster

matan Fri, 24 Oct 2014 14:36:39 -0700

Thanks Marcelo,

Let me spin this towards a parallel trajectory then, as the title change
implies. I think I will further read some of the articles at
https://spark.apache.org/research.html but basically, I understand Spark
keeps the data in-memory, and only pulls from hdfs, or at most writes the
final output of a job back to it, rather than depositing intermediary step
outputs to hdfs files like hadoop map reduce would typically do (?).


Just a small departing question then - does it also work with the *Gluster*
or *Ceph* distributed file systems, not just hdfs? reading some of the
documentation I think if my Scala code can read a file from either of
those, and I have a Spark standalone cluster, or one managed by Mesos, then
I am not bound to hdfs nor to hadoop..

I assume that architecture will consume a lot of bandwidth pulling the
inputs from the file system, and not leverage the placement of its
workers/executors along with the data nodes as it does with hadoop/hdfs
(... thus acting as a compute cluster that takes very long to bootstrap an
application). Perhaps however, hadoop's ability to work over GlusterFS or
CephFS would provide that data-locality benefit after all? or is it bound
specifically to the hdfs api of hadoop, for performing a local data pull on
the storage cluster machines?

Thanks,
Matan



On Fri, Oct 24, 2014 at 4:19 AM, Marcelo Vanzin [via Apache Spark User
List] <ml-node+s1001560n17170...@n3.nabble.com> wrote:

> You assessment is mostly correct. I think the only thing I'd reword is
> the comment about splitting the data, since Spark itself doesn't do
> that, but read on.
>
> On Thu, Oct 23, 2014 at 6:12 PM, matan <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=0>> wrote:
> > In case I nailed it, how then does it handle a distributed hdfs file?
> does
> > it pull all of the file to/through one Spark server
>
> Well, Spark here just piggybacks on what HDFS already gives you, since
> it's a distributed file system. In HDFS, files are broken into blocks
> and each block is stored in one or more machine. Spark uses Hadoop
> classes that understand this and give you the information about where
> those blocks are.
>
> If there are Spark executors on those machines holding the blocks,
> Spark will try to run tasks on those executors. Otherwise, it will
> assign some other executor to do the computation, and that executor
> will pull that particular block from HDFS over the network.
>
> It can be a lot more complicated than that (since each file format may
> have different ways of partitioning data, or you can create your own
> way, or repartition data, or Spark may give up waiting for the right
> executor, or...), but that's a good first start.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=17170&i=2>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169p17170.html
>  To unsubscribe from Spark using HDFS data [newb], click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=17169&code=ZGV2Lm1hdGFuQGdtYWlsLmNvbXwxNzE2OXwtMTk5NzYyNTg2NQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Spark-using-non-HDFS-data-on-a-distributed-file-system-cluster-tp17239.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark using non-HDFS data on a distributed file system cluster

Reply via email to