This is nice. Which version of Spark has this support ? Or do I need to build it. I have never built Spark from git, please share instructions for Hadoop 2.4.x YARN.
I am struggling a lot to get a join work between 200G and 2TB datasets. I am constantly getting this exception 1000s of executors are failing with 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162 java.io.IOException: Failed to connect to executor_host_name/executor_ip_address:60162 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <ko...@tresata.com> wrote: > we went through a similar process, switching from scalding (where > everything just works on large datasets) to spark (where it does not). > > spark can be made to work on very large datasets, it just requires a > little more effort. pay attention to your storage levels (should be > memory-and-disk or disk-only), number of partitions (should be large, > multiple of num executors), and avoid groupByKey > > also see: > https://github.com/tresata/spark-sorted (for avoiding in memory > operations for certain type of reduce operations) > https://github.com/apache/spark/pull/6883 (for blockjoin) > > > On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> > wrote: > >> Not far at all. On large data sets everything simply fails with Spark. >> Worst is am not able to figure out the reason of failure, the logs run >> into millions of lines and i do not know the keywords to search for failure >> reason >> >> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolf...@gmail.com> >> wrote: >> >>> How far did you get? >>> >>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>> wrote: >>> >>>> We use Scoobi + MR to perform joins and we particularly use blockJoin() >>>> API of scoobi >>>> >>>> >>>> /** Perform an equijoin with another distributed list where this list >>>> is considerably smaller >>>> * than the right (but too large to fit in memory), and where the keys >>>> of right may be >>>> * particularly skewed. */ >>>> >>>> def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A, >>>> B))] = >>>> Relational.blockJoin(left, right) >>>> >>>> >>>> I am trying to do a POC and what Spark join API(s) is recommended to >>>> achieve something similar ? >>>> >>>> Please suggest. >>>> >>>> -- >>>> Deepak >>>> >>>> >>> >> >> >> -- >> Deepak >> >> > -- Deepak