If the data file is same then it should have similar distribution of keys.
Few queries-

1. Did you compare the number of partitions in both the cases?
2. Did you compare the resource allocation for Spark Shell vs Scala Program
being submitted?

Also, can you please share the details of Spark Context, Environment and
Executors when you run via Scala program?

On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello All,
>
> We are using HashPartitioner in the following way on a 3 node cluster (1
> master and 2 worker nodes).
>
> val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
> Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
> x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()
>
> u.count()
>
> If we run this from the spark shell, the data (52 MB) is split across the
> two worker nodes. But if we put this in a scala program and run it, then
> all the data goes to only one node. We have run it multiple times, but this
> behavior does not change. This seems strange.
>
> Is there some problem with the way we use HashPartitioner?
>
> Thanks in advance.
>
> Regards,
> Raghava.
>

Reply via email to