Streaming partition-by data locality for state lookupon executor

2022-04-13 Thread Sandip Khanzode
Kinesis? What I would want to finally achieve is that the flatMapGroupWithState() that I would call later in the pipeline should have the same (partition) key internally for key lookups in the (RocksDB) state so that data locality can be achieved. Is this redundant or implicit or not possible

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
ich IP or hostname of data-nodes returns > from name-node to the spark? or Can you offer me a debug approach? > >> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer >> mailto:russell.spit...@gmail.com>> wrote: >> >> Data locality can only occur if the Spar

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer
Data locality can only occur if the Spark Executor IP address string matches the preferred location returned by the file system. So this job would only have local tasks if the datanode replicas for the files in question had the same ip address as the Spark executors you are using. If they don't

[Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Mohamadreza Rostami
m/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

[ spark-streaming ] - Data Locality issue

2020-02-04 Thread Karthik Srinivas
Hi, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Data locality

2020-02-04 Thread Karthik Srinivas
Hi all, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
gt;>> >>> thanks for answering! >>> >>> >>>> Although the Spark task scheduler is aware of rack-level data locality, it >>>> seems that only YARN implements the support for it. >>> This explains why the script that I configured in

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
> >>> Although the Spark task scheduler is aware of rack-level data locality, it >>> seems that only YARN implements the support for it. >> This explains why the script that I configured in core-site.xml >> topology.script.file.name is not called in by the sp

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
g! > > >> Although the Spark task scheduler is aware of rack-level data locality, it >> seems that only YARN implements the support for it. > > This explains why the script that I configured in core-site.xml > topology.script.file.name is not called in by the spark

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Hi Sun Rui, thanks for answering! > Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. This explains why the script that I configured in core-site.xml topology.script.file.name is not called in by the

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
Although the Spark task scheduler is aware of rack-level data locality, it seems that only YARN implements the support for it. However, node-level locality can still work for Standalone. It is not necessary to copy the hadoop config files into the Spark CONF directory. Set HADOOP_CONF_DIR

[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba
Hi, I am running a couple of docker hosts, each with an HDFS and a spark worker in a spark standalone cluster. In order to get data locality awareness, I would like to configure Racks for each host, so that a spark worker container knows from which hdfs node container it should load its data

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Eugene Morozov
Does Spark uses data locality information from HDFS, when running in > standalone mode? Or is it running on YARN mandatory for such purpose? I > can't find this information in the docs, and on Google I am only finding > contrasting opinion on that. > > Regards > Marco Capuccini >

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
ill know about the datanodes from >> %HADOOP_HOME/etc/Hadoop/slaves >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/prof

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 5 June 2016 at 10:50, Marco Capuccini <marco.capucc...@farmbio.uu.se> > wrote: > &

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
V8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> On 5 June 2016 at 10:50, Marco Capuccini <marco.capucc...@farmbio.uu.se<mailto:marco.capucc...@farmbio.uu.se>> wrote: Dear all, Does Spark uses data locality information from HDFS, when running in stan

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
uccini <marco.capucc...@farmbio.uu.se> wrote: > Dear all, > > Does Spark uses data locality information from HDFS, when running in > standalone mode? Or is it running on YARN mandatory for such purpose? I > can't find this information in the docs, and on Google I am only fi

Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
Dear all, Does Spark uses data locality information from HDFS, when running in standalone mode? Or is it running on YARN mandatory for such purpose? I can't find this information in the docs, and on Google I am only finding contrasting opinion on that. Regards Marco Capuccini

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Yuval.Itzchakov
, benefit from low IO latency and high throughput. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165p26170.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
ect: Re: Apache Spark data locality when integrating with Kafka I would definitely try to avoid hosting Kafka and Spark on the same servers. Kafka and Spark will be doing alot of IO between them, so you'll want to maximize on those resources and not share them on the same server. You'll want e

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread أنس الليثي
r . > > > > Sent from Samsung Mobile. > > > Original message > From: "Yuval.Itzchakov" <yuva...@gmail.com> > Date:07/02/2016 19:38 (GMT+05:30) > To: user@spark.apache.org > Cc: > Subject: Re: Apache Spark data locality when integrati

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi
We   are using spark in  two  ways  1. Yarn with spark support. Kafka running along with  data nodes  2.  Spark master and workers  running  with  some  of  Kafka brokers.  Data locality is  important. Regards Diwakar  Sent from Samsung Mobile. Original message From: أنس

Apache Spark data locality when integrating with Kafka

2016-02-06 Thread fanooos
/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

RE: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Diwakar Dhanuskodi
Yes . To  reduce  network  latency . Sent from Samsung Mobile. Original message From: fanooos <dev.fano...@gmail.com> Date:07/02/2016 09:24 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Apache Spark data locality when integrating with Kafka Dears If I wi

Re: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Koert Kuipers
spark can benefit from data locality and will try to launch tasks on the node where the kafka partition resides. however i think in production many organizations run a dedicated kafka cluster. On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote:

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao
, this is same for different cluster manager. Thanks Saisai On Thu, Jan 28, 2016 at 10:50 AM, Todd <bit1...@163.com> wrote: > Hi, > I am kind of confused about how data locality is honored when spark is > running on yarn(client or cluster mode),can someone please elaberate on > this? Thanks! > > >

How data locality is honored when spark is running on yarn

2016-01-27 Thread Todd
Hi, I am kind of confused about how data locality is honored when spark is running on yarn(client or cluster mode),can someone please elaberate on this? Thanks!

Data Locality Issue

2015-11-15 Thread Renu Yadav
Hi, I am working on spark 1.4 and reading a orc table using dataframe and converting that DF to RDD I spark UI I observe that 50 % task are running on locality and ANY and very few on LOCAL. What would be the possible reason for this? Please help. I have even changed locality settings Thanks

Re: Data Locality Issue

2015-11-15 Thread Renu Yadav
what are the parameters on which locality depends On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote: > Hi, > > I am working on spark 1.4 and reading a orc table using dataframe and > converting that DF to RDD > > I spark UI I observe that 50 % task are running on locality and

Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane, Tachyon provides an api to get the block locations of the file which Spark uses when scheduling tasks. Hope this helps, Calvin On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane <shane.kinse...@aspect.com> wrote: > Hi all, > > > > I am looking into how Spark hand

How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Kinsella, Shane
Hi all, I am looking into how Spark handles data locality wrt Tachyon. My main concern is how this is coordinated. Will it send a task based on a file loaded from Tachyon to a node that it knows has that file locally and how does it know which nodes has what? Kind regards, Shane This email

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni
Hi All , Just wanted to find out if there is an benefits to installing kafka brokers and spark nodes on the same machine ? is it possible that spark can pull data from kafka if it is local to the node i.e. the broker or partition is on the same machine. Thanks, Ashish

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase
seconds. -adrian From: Cody Koeninger <c...@koeninger.org> Sent: Monday, September 21, 2015 10:19 PM To: Ashish Soni Cc: user Subject: Re: Spark Streaming and Kafka MultiNode Setup - Data Locality The direct stream already uses the kafka leader for a

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Cody Koeninger
The direct stream already uses the kafka leader for a given partition as the preferred location. I don't run kafka on the same nodes as spark, and I don't know anyone who does, so that situation isn't particularly well tested. On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui
Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote: Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect

Data locality with HDFS not being seen

2015-08-20 Thread Sunil
Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the tasks to be executed only on the node which has the data but this is not happening (ofcourse, unless the node is busy in which case, I understand tasks can go to some other node). Could anyone

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at all

data locality in spark

2015-04-27 Thread Grandl Robert
Hi guys, I am running some SQL queries, but all my tasks are reported as either NODE_LOCAL or PROCESS_LOCAL.  In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because they have to aggregate data from multiple hosts. However, in Spark even the aggregation stages are reported

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single executor rather than fetching it from others during shuffle. Is it possible

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li
Response inline. On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com wrote: (resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the

Data locality across jobs

2015-04-01 Thread kjsingh
Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Sean Bigdatafun
(resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the Mesos benefits of dynamically scheduling resources go away altogether. Can Tachyon be resource managed by

deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the best practices around HDFS and Tachyon. The documentation about Spark's data-locality section seems to point

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
deployment. I can't seem to figure out the best practices around HDFS and Tachyon. The documentation about Spark's data-locality section seems to point that each of my mesos slave nodes should also run a hdfs datanode. This seems fine but I can't seem to figure out how I would go about pushing

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
...@brightcove.com wrote: Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the best practices around HDFS and Tachyon. The documentation about Spark's data-locality section seems to point that each of my mesos

Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-14 Thread eric wong
data locality: // Pack each app into as few nodes as possible until we've assigned all its cores for (worker - workers if worker.coresFree 0 worker.state == WorkerState.ALIVE) { for (app - waitingApps if app.coresLeft 0) { if (canUse(app, worker)) { val coresToUse

How does Spark honor data locality when allocating computing resources for an application

2015-03-13 Thread bit1...@163.com
Hi, sparkers, When I read the code about computing resources allocation for the newly submitted application in the Master#schedule method, I got a question about data locality: // Pack each app into as few nodes as possible until we've assigned all its cores for (worker - workers

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv
Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files

Re: Data Locality

2015-01-28 Thread hnahak
-list.1001560.n3.nabble.com/Data-Locality-tp21000p21413.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user

Re: Data Locality

2015-01-28 Thread Harihar Nahak
? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000p21410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: data locality in logs

2015-01-28 Thread hnahak
in context: http://apache-spark-user-list.1001560.n3.nabble.com/data-locality-in-logs-tp1276p21416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Data locality running Spark on Mesos

2015-01-11 Thread Michael V Le
as with Mesos. Looking at the logs again, it looks like the locality info between the stand-alone and Mesos coarse-grained mode are very similar. I must have been hallucinating earlier thinking somehow the data locality information was different. So this whole thing might just simply be due to the fact

Re: Data locality running Spark on Mesos

2015-01-10 Thread Timothy Chen
for every task? Of course, any perceived slow down will probably be very dependent on the workload. I just want to have a feel of the possible overhead (e.g., factor of 2 or 3 slowdown?). If not a data locality issue, perhaps this overhead can be a factor in the slowdown I observed, at least

Re: Data locality running Spark on Mesos

2015-01-09 Thread Michael V Le
executors for every task? Of course, any perceived slow down will probably be very dependent on the workload. I just want to have a feel of the possible overhead (e.g., factor of 2 or 3 slowdown?). If not a data locality issue, perhaps this overhead can be a factor in the slowdown I observed, at least

Data locality running Spark on Mesos

2015-01-08 Thread mvle
, especially for coarse-grained mode as the executors supposedly do not go away until job completion. Any ideas? Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html Sent from the Apache Spark User

Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen
do not go away until job completion. Any ideas? Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Data Locality

2015-01-06 Thread Andrew Ash
You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote: No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes

Re: Data Locality

2015-01-06 Thread Cody Koeninger
is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources to execute the tasks)? Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Data Locality

2015-01-06 Thread gtinside
the data is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources to execute the tasks)? Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld
I am seeing skewed execution times. As far as I can tell, they are attributable to differences in data locality - tasks with locality PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest. This seems entirely as it should be - the question is, why the different locality levels? I am

Re: data locality, task distribution

2014-11-13 Thread Aaron Davidson
. As far as I can tell, they are attributable to differences in data locality - tasks with locality PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest. This seems entirely as it should be - the question is, why the different locality levels? I am seeing skewed caching, as I

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
...@oculusinfo.com wrote: Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld
, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson
point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several inconsistencies I don't

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several

problem with data locality api

2014-09-28 Thread qinwei
Hi, everyone? ? I come across with a problem about data locality, i found these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt

RE: problem with data locality api

2014-09-28 Thread Shao, Saisai
Subject: problem with data locality api Hi, everyone I come across with a problem about data locality, i found these example code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》 val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat

回复: RE: problem with data locality api

2014-09-28 Thread qinwei
for your reply! qinwei ?发件人:?Shao, Saisai发送时间:?2014-09-28?14:42收件人:?qinwei抄送:?user主题:?RE: problem with data locality api Hi ? First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two

Re: data locality

2014-08-30 Thread Chris Fregly
, 2014 at 4:13 AM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, In the standalone mode, how can we check data locality is working as expected when tasks are assigned? Thanks! On 23 Jul, 2014, at 12:49 am, Sandy Ryza sandy.r...@cloudera.com wrote: On standalone there is still special

Re: data locality

2014-07-22 Thread Sandy Ryza
for your patience! -- *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* 2014年7月22日 9:47 *To:* user@spark.apache.org *Subject:* Re: data locality This currently only works for YARN. The standalone default is to place an executor on every node

RE: data locality

2014-07-21 Thread Haopu Wang
you for your patience! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: 2014年7月22日 9:47 To: user@spark.apache.org Subject: Re: data locality This currently only works for YARN. The standalone default is to place an executor on every node for every

data locality

2014-07-18 Thread Haopu Wang
I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node? How about a spark cluster on Yarn? Thank you very much!

Re: data locality

2014-07-18 Thread Sandy Ryza
any information about where the input data for the jobs is located. If the executors occupy significantly fewer nodes than exist in the cluster, it can be difficult for Spark to achieve data locality. The workaround for this is an API that allows passing in a set of preferred locations when

RE: data locality

2014-07-18 Thread Haopu Wang
executors to use for this application? Thanks again! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: Friday, July 18, 2014 3:44 PM To: user@spark.apache.org Subject: Re: data locality Hi Haopu, Spark will ask HDFS for file block locations

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
HDFS on the same cluster as Spark, write the data from the Actors to HDFS, and then use HDFS as input source for Spark Streaming. Does this result in better performance due to data locality (with HDFS data replication turned on)? I think performance should be almost the same with actors, since

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
fault tolerance, and the ability to checkpoint and recover even if master fails. Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html Sent