Hm, now I wonder if it's the same issue here:
https://issues.apache.org/jira/browse/SPARK-10149

Does the setting described there help?

On Mon, Oct 26, 2015 at 11:39 AM, Jinfeng Li <liji...@gmail.com> wrote:

> Hi, I have already tried the same code with Spark 1.3.1, there is no such
> problem. The configuration files are all directly copied from Spark 1.5.1.
> I feel it is a bug on Spark 1.5.1.
>
> Thanks a lot for your response.
>
> On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <so...@cloudera.com> wrote:
>
>> Yeah, are these stats actually reflecting data read locally, like through
>> the loopback interface? I'm also no expert on the internals here but this
>> may be measuring effectively local reads. Or are you sure it's not?
>>
>> On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>>
>>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>>> >
>>> > Replication factor is 3 and we have 18 data nodes. We check HDFS
>>> webUI, data is evenly distributed among 18 machines.
>>> >
>>>
>>>
>>> every block in HDFS (usually 64-128-256 MB) is distributed across three
>>> machines, meaning 3 machines have it local, 15 have it remote.
>>>
>>> for data locality to work properly, you need the executors to be reading
>>> in the blocks of data local to them, and not data from other parts of the
>>> files. Spark does try to do locality, but if there's only a limited set of
>>> executors, then more of the workload is remote vs local.
>>>
>>> I don't know of an obvious way to get the metrics here of local vs
>>> remote; I don't see the HDFS client library tracking that —though it should
>>> be the place to collect stats on local/remote/domain-socket-direct IO. Does
>>> anyone know somewhere in the Spark metrics which tracks placement locality?
>>> If not, both layers could have some more metrics added.
>>
>>
>>

Reply via email to