Re: Join highly skewed datasets

Ted Yu Sun, 28 Jun 2015 13:02:07 -0700

maven command needs to be passed through --mvn option.

Cheers


On Sun, Jun 28, 2015 at 12:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote:

> Running this now
>
>  ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
> -Phive -Phive-thriftserver -DskipTests clean package
>
>
> Waiting for it to complete. There is no progress after initial log messages
>
>
> //LOGS
>
> $ ./make-distribution.sh  --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
> -Phive -Phive-thriftserver -DskipTests clean package
>
> +++ dirname ./make-distribution.sh
>
> ++ cd .
>
> ++ pwd
>
> + SPARK_HOME=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0
>
> + DISTDIR=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist
>
> + SPARK_TACHYON=false
>
> + TACHYON_VERSION=0.6.4
>
> + TACHYON_TGZ=tachyon-0.6.4-bin.tar.gz
>
> + TACHYON_URL=
> https://github.com/amplab/tachyon/releases/download/v0.6.4/tachyon-0.6.4-bin.tar.gz
>
> + MAKE_TGZ=false
>
> + NAME=none
>
> + MVN=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn
>
> + ((  9  ))
>
> + case $1 in
>
> + MAKE_TGZ=true
>
> + shift
>
> + ((  8  ))
>
> + case $1 in
>
> + break
>
> + '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/
> ']'
>
> + '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/
> ']'
>
> ++ command -v git
>
> + '[' /usr/bin/git ']'
>
> ++ git rev-parse --short HEAD
>
> ++ :
>
> + GITREV=
>
> + '[' '!' -z '' ']'
>
> + unset GITREV
>
> ++ command -v /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn
>
> + '[' '!' /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn ']'
>
> ++ /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn help:evaluate
> -Dexpression=project.version -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0
> -Phive -Phive-thriftserver -DskipTests clean package
>
> ++ grep -v INFO
>
> ++ tail -n 1
>
> //LOGS
>
> On Sun, Jun 28, 2015 at 12:17 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
> wrote:
>
>> I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz"
>> file ?
>>
>> On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> You can use the following command to build Spark after applying the pull
>>> request:
>>>
>>> mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
>>>
>>>
>>> Cheers
>>>
>>>
>>> On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>> wrote:
>>>
>>>> I see that block support did not make it to spark 1.4 release.
>>>>
>>>> Can you share instructions of building spark with this support for
>>>> hadoop 2.4.x distribution.
>>>>
>>>> appreciate.
>>>>
>>>> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>> wrote:
>>>>
>>>>> This is nice. Which version of Spark has this support ? Or do I need
>>>>> to build it.
>>>>> I have never built Spark from git, please share instructions for
>>>>> Hadoop 2.4.x YARN.
>>>>>
>>>>> I am struggling a lot to get a join work between 200G and 2TB
>>>>> datasets. I am constantly getting this exception
>>>>>
>>>>> 1000s of executors are failing with
>>>>>
>>>>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to
>>>>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
>>>>> java.io.IOException: Failed to connect to
>>>>> executor_host_name/executor_ip_address:60162
>>>>> at
>>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>>>>> at
>>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>>>> at
>>>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>>>>> at
>>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>>>> at
>>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>>>>> at
>>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>>>>> at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> we went through a similar process, switching from scalding (where
>>>>>> everything just works on large datasets) to spark (where it does not).
>>>>>>
>>>>>> spark can be made to work on very large datasets, it just requires a
>>>>>> little more effort. pay attention to your storage levels (should be
>>>>>> memory-and-disk or disk-only), number of partitions (should be large,
>>>>>> multiple of num executors), and avoid groupByKey
>>>>>>
>>>>>> also see:
>>>>>> https://github.com/tresata/spark-sorted (for avoiding in memory
>>>>>> operations for certain type of reduce operations)
>>>>>> https://github.com/apache/spark/pull/6883 (for blockjoin)
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Not far at all. On large data sets everything simply fails with
>>>>>>> Spark. Worst is am not able to figure out the reason of failure,  the 
>>>>>>> logs
>>>>>>> run into millions of lines and i do not know the keywords to search for
>>>>>>> failure reason
>>>>>>>
>>>>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolf...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How far did you get?
>>>>>>>>
>>>>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We use Scoobi + MR to perform joins and we particularly use
>>>>>>>>> blockJoin() API of scoobi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /** Perform an equijoin with another distributed list where this
>>>>>>>>> list is considerably smaller
>>>>>>>>> * than the right (but too large to fit in memory), and where the
>>>>>>>>> keys of right may be
>>>>>>>>> * particularly skewed. */
>>>>>>>>>
>>>>>>>>>  def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K,
>>>>>>>>> (A, B))] =
>>>>>>>>>     Relational.blockJoin(left, right)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am trying to do a POC and what Spark join API(s) is recommended
>>>>>>>>> to achieve something similar ?
>>>>>>>>>
>>>>>>>>> Please suggest.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Deepak
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Deepak
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>

Re: Join highly skewed datasets

Reply via email to