maven command needs to be passed through --mvn option. Cheers
On Sun, Jun 28, 2015 at 12:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote: > Running this now > > ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 > -Phive -Phive-thriftserver -DskipTests clean package > > > Waiting for it to complete. There is no progress after initial log messages > > > //LOGS > > $ ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 > -Phive -Phive-thriftserver -DskipTests clean package > > +++ dirname ./make-distribution.sh > > ++ cd . > > ++ pwd > > + SPARK_HOME=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0 > > + DISTDIR=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist > > + SPARK_TACHYON=false > > + TACHYON_VERSION=0.6.4 > > + TACHYON_TGZ=tachyon-0.6.4-bin.tar.gz > > + TACHYON_URL= > https://github.com/amplab/tachyon/releases/download/v0.6.4/tachyon-0.6.4-bin.tar.gz > > + MAKE_TGZ=false > > + NAME=none > > + MVN=/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn > > + (( 9 )) > > + case $1 in > > + MAKE_TGZ=true > > + shift > > + (( 8 )) > > + case $1 in > > + break > > + '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/ > ']' > > + '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/ > ']' > > ++ command -v git > > + '[' /usr/bin/git ']' > > ++ git rev-parse --short HEAD > > ++ : > > + GITREV= > > + '[' '!' -z '' ']' > > + unset GITREV > > ++ command -v /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn > > + '[' '!' /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn ']' > > ++ /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/build/mvn help:evaluate > -Dexpression=project.version -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 > -Phive -Phive-thriftserver -DskipTests clean package > > ++ grep -v INFO > > ++ tail -n 1 > > //LOGS > > On Sun, Jun 28, 2015 at 12:17 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> > wrote: > >> I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz" >> file ? >> >> On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> You can use the following command to build Spark after applying the pull >>> request: >>> >>> mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package >>> >>> >>> Cheers >>> >>> >>> On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>> wrote: >>> >>>> I see that block support did not make it to spark 1.4 release. >>>> >>>> Can you share instructions of building spark with this support for >>>> hadoop 2.4.x distribution. >>>> >>>> appreciate. >>>> >>>> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>> wrote: >>>> >>>>> This is nice. Which version of Spark has this support ? Or do I need >>>>> to build it. >>>>> I have never built Spark from git, please share instructions for >>>>> Hadoop 2.4.x YARN. >>>>> >>>>> I am struggling a lot to get a join work between 200G and 2TB >>>>> datasets. I am constantly getting this exception >>>>> >>>>> 1000s of executors are failing with >>>>> >>>>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to >>>>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162 >>>>> java.io.IOException: Failed to connect to >>>>> executor_host_name/executor_ip_address:60162 >>>>> at >>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) >>>>> at >>>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) >>>>> at >>>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) >>>>> at >>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) >>>>> at >>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) >>>>> at >>>>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> we went through a similar process, switching from scalding (where >>>>>> everything just works on large datasets) to spark (where it does not). >>>>>> >>>>>> spark can be made to work on very large datasets, it just requires a >>>>>> little more effort. pay attention to your storage levels (should be >>>>>> memory-and-disk or disk-only), number of partitions (should be large, >>>>>> multiple of num executors), and avoid groupByKey >>>>>> >>>>>> also see: >>>>>> https://github.com/tresata/spark-sorted (for avoiding in memory >>>>>> operations for certain type of reduce operations) >>>>>> https://github.com/apache/spark/pull/6883 (for blockjoin) >>>>>> >>>>>> >>>>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Not far at all. On large data sets everything simply fails with >>>>>>> Spark. Worst is am not able to figure out the reason of failure, the >>>>>>> logs >>>>>>> run into millions of lines and i do not know the keywords to search for >>>>>>> failure reason >>>>>>> >>>>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolf...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> How far did you get? >>>>>>>> >>>>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We use Scoobi + MR to perform joins and we particularly use >>>>>>>>> blockJoin() API of scoobi >>>>>>>>> >>>>>>>>> >>>>>>>>> /** Perform an equijoin with another distributed list where this >>>>>>>>> list is considerably smaller >>>>>>>>> * than the right (but too large to fit in memory), and where the >>>>>>>>> keys of right may be >>>>>>>>> * particularly skewed. */ >>>>>>>>> >>>>>>>>> def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, >>>>>>>>> (A, B))] = >>>>>>>>> Relational.blockJoin(left, right) >>>>>>>>> >>>>>>>>> >>>>>>>>> I am trying to do a POC and what Spark join API(s) is recommended >>>>>>>>> to achieve something similar ? >>>>>>>>> >>>>>>>>> Please suggest. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Deepak >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Deepak >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Deepak >>>>> >>>>> >>>> >>>> >>>> -- >>>> Deepak >>>> >>>> >>> >> >> >> -- >> Deepak >> >> > > > -- > Deepak > >