Oh, and in reply to the question (many threads ago) about globs/directories for input paths: yes, they work just fine.
On Sun, Jun 17, 2012 at 9:59 PM, Josh Wills <[email protected]> wrote: > Hey Stefan, > > Reply inlined. > > On Sat, Jun 16, 2012 at 6:03 AM, <[email protected]> wrote: >> Hey Josh, >> >> @TupleWritables: Yes I saw that. Didn't think too much about the performance >> implications though. Writing all the classinfo is unnecessary because it is >> statically known, and an identifier will help with that. But using >> reflection will use much more CPU time than if you can avoid it, as proven >> by my little experiment, and it will not help with that. >> >> Kryo uses a neat trick for cutting down serialization size. By forcing the >> user to register all classes in well defined sequence, it can use the index >> for a class in that sequence to describe it. Not as good as describing a >> whole schema, but generally applicable for serialization. >> >> Using avro internally sounds like a good idea. I don't know it too well >> though. So you would give up the TupleWritable's completely, replace them >> with AvroWritables, and use the mentioned commit to use Writables within >> Avro? What do you mean with "I think there would be some cool stuff we could >> do if we could assume how all shuffle serializations worked, ..."? >> Well the reflection call would stay, but that can be solved independently I >> guess (for example by having an optional factory argument in >> BytesToWritableMapFn). > > Re: cool stuff, I suspect that having Avro for all of the intermediate > data transfer would dramatically simplify the implementation of MSCR > fusion, which I've been putting off b/c having an abstraction that > would handle fusion for both Writables and Avro makes my head hurt. > I'm going to start a thread on [email protected] > advocating for it. > >> >> What still bothers me is that the partitioner for the join, which only has >> to ensure that the bit indicating whether it's left or right is ignored, >> still has to do much too much work. If he could just get the serialized >> bytes for the key, he could compute the hashcode for the byte array >> directly, just ignoring the last byte. The deserialization there is >> unnecessary and is actually, at least in my profiling, what seemed to hurt >> the performance of the join so badly. Maybe some more benchmarking is needed >> though for this. > > That doesn't sound right-- isn't the Partitioner call is done at the > end of the map task, before the data is ever serialized? > >> >> Guess I should contact Alexy Khravbrov then. >> >> Cheers, >> Stefan >> >> PS: Sorry, replied directly instead to the mailing list. >> >> Am Freitag, 15. Juni 2012 17:37:06 UTC+2 schrieb Josh Wills: >>> >>> On Fri, Jun 15, 2012 at 2:24 AM, <[email protected]> wrote: >>> > Hey Josh, >>> > >>> > I actually first found pangool and from there concluded that Crunch is >>> > worth >>> > a try. So you are saying that the TupleWritable's are in general quite >>> > slow, >>> > and the performance of Crunch is in that case not comparable to pure >>> > Hadoop? >>> > If it really makes a difference we should look into generating >>> > TupleWritable >>> > subclasses for our project (if there will be another round of >>> > benchmarks). >>> >>> The impl of TupleWritable in Crunch is too conservative, in that it >>> passes along the names of the Writable classes that it serialized in >>> the tuple along with the actual data, which leads to a pretty massive >>> blowup in the amount of data that gets passed around. My initial >>> thought in doing this was that I wanted to be sure that the Crunch >>> output was always readable by anything-- e.g., you didn't have to use >>> Crunch in order to read the data back, since all of the information on >>> what the data contained was there. This is essentially what Avro gets >>> you, although the Avro file format is smarter about just putting the >>> schema at the head of the file and not copying it over and over again, >>> which is why we generally recommend Avro + Crunch. The #s in the >>> pangool benchmark were all based on Avro. >>> >>> The tuple-oriented frameworks handle this by having integer >>> identifiers for the different data types are supported, and that's >>> certainly one way to improve the performance here. I've also been >>> tossing around the idea of using Avro for everything internal to the >>> framework (e.g., any data transferred during the shuffle stage), since >>> I just added a way for AvroTypes to support arbitrary writables: >>> >>> >>> https://github.com/cloudera/crunch/commit/224102ac4813fc0e124114026438a2e3884f858b >>> >>> I think there would be some cool stuff we could do if we could assume >>> how all shuffle serializations worked, but I haven't benchmarked it >>> yet or tossed the idea around with the other committers. >>> >>> > >>> > You didn't comment on the the SourceTargetHelper. Do glob patterns now >>> > work >>> > fine as inputs? It just failed in 0.2.4 when I specified lineitems* as >>> > input >>> > in local mode (no hdfs), while it worked fine in Scoobi. >>> >>> Sorry I missed that-- I was under the impression it did work, but I'll >>> take a look at it and report back. >>> >>> > >>> > Right now I just need to finish my thesis, so currently I am not >>> > planning to >>> > commit anywhere. >>> >>> It never hurts to ask. I saw your threads on the spark and scoobi >>> mailing lists and enjoyed them-- please keep us posted on what you're >>> up to. Alexy Khravbrov at Klout was also interested in an abstraction >>> layer for Scala MapReduce vs. Spark-- did you talk to him? >>> >>> > >>> > Cheers, >>> > Stefan >>> > >>> > Am Donnerstag, 14. Juni 2012 16:42:43 UTC+2 schrieb Josh Wills: >>> >> >>> >> Hey Stefan, >>> >> >>> >> Thanks for your email. Re: join performance, I agree with you that the >>> >> current implementation that uses Writables is pretty terrible, and >>> >> I've been thinking about good ways to do away with it for awhile now. >>> >> The good news is that the Avro-based implementation of PTypes is >>> >> pretty fast, and that's what most people end up using in practice. For >>> >> example, when the Pangool guys were benchmarking frameworks, they used >>> >> the Avro PTypes, and it usually runs pretty close to native MR: >>> >> http://pangool.net/benchmark.html >>> >> >>> >> Re: the gist, I will take a look at it. The good news about the pull >>> >> request is that we just submitted Crunch to the Apache Incubator and >>> >> are in the process of moving over all of the infrastructure. It would >>> >> be great to have your patches in when we're done-- we're always on the >>> >> lookout for people who are interested in becoming committers. >>> >> >>> >> Best, >>> >> Josh >>> >> >>> >> On Thu, Jun 14, 2012 at 6:53 AM, <[email protected]> wrote: >>> >> > Hi, >>> >> > >>> >> > I did some more investigations. You set the partitioner correctly, >>> >> > but >>> >> > you >>> >> > do not set the comparator. But actually the comparator might not be >>> >> > needed, >>> >> > because the value with 0 will always come first by the default >>> >> > comparator? >>> >> > If you rely on that, then maybe you should put a comment there as it >>> >> > is >>> >> > not >>> >> > immediately obvious. >>> >> > >>> >> > The performance problem with joins (as shown by profiling) is that >>> >> > Hadoop >>> >> > does not know about the PType's. So to deserialize during sorting the >>> >> > general TupleWritable is used, which is very inefficient as it uses >>> >> > reflection. I added a cache for the Class.forName call in readFields >>> >> > and it >>> >> > improves performance a lot, but it is not a definitive answer to the >>> >> > problem. Maybe you could add a special writable for TaggedKey's, >>> >> > which >>> >> > knows >>> >> > that one of them is an Integer. To define the partitioning, the >>> >> > actual >>> >> > content does not matter, just that the tagging value is ignored. >>> >> > Scoobi >>> >> > even >>> >> > generates classes for TaggedKeys and TaggedValues, that would most >>> >> > likely be >>> >> > even faster. >>> >> > >>> >> > Here are some numbers for a 1GB join on my laptop, second column has >>> >> > the >>> >> > total amount of seconds. >>> >> > My version of a join, which tags the values by adding a boolean to >>> >> > them. >>> >> > Note that in most cases, where the size of one group does not come >>> >> > close >>> >> > to >>> >> > the available RAM, this is totally fine. Implemented in [1]. >>> >> > crunch_4 33.42 13 2.06 26.82 6526368 86% >>> >> > Crunch's original join >>> >> > crunch_4 150.11 13 2.79 143.50 7924496 97% >>> >> > Crunch's join with my caching changes in TupleWritable: >>> >> > crunch_4 69.67 13 2.51 59.99 7965808 89% >>> >> > >>> >> > Too lazy to open a pull request, here is a gist of my changes for the >>> >> > join. >>> >> > https://gist.github.com/2930414 >>> >> > These are for 0.2.4, I hope it's still applicable for trunk. >>> >> > >>> >> > Regards, >>> >> > Stefan Ackermann >>> >> > >>> >> > [1] >>> >> > >>> >> > >>> >> > https://github.com/Stivo/Distributed/blob/master/crunch/src/main/scala/ch/epfl/distributed/utils/CrunchUtils.scala#L83 >>> >> > >>> >> > Am Donnerstag, 14. Juni 2012 12:24:34 UTC+2 schrieb [email protected]: >>> >> >> >>> >> >> Hi, >>> >> >> >>> >> >> I am writing a Distributed Collections DSL with compiler >>> >> >> optimizations >>> >> >> as >>> >> >> my master thesis. I have added Crunch as a backend, since we were >>> >> >> not >>> >> >> happy >>> >> >> with the performance we were getting from our other backend for >>> >> >> Hadoop. >>> >> >> And >>> >> >> I must say, I am quite impressed by Crunch's performance. >>> >> >> Here is some sample code our DSL generates, in case you are >>> >> >> interested: >>> >> >> >>> >> >> >>> >> >> >>> >> >> https://github.com/Stivo/Distributed/blob/bigdata2012/crunch/src/main/scala/generated/v4/ >>> >> >> >>> >> >> We are impressed with the performance, except for the joins. In >>> >> >> local >>> >> >> benchmarks, joins performed 100x worse than my own join >>> >> >> implementation >>> >> >> (no >>> >> >> joke, 100x). Also it seems like the current implementation has some >>> >> >> bugs. >>> >> >> You are setting the partitioner class to the comparator in >>> >> >> >>> >> >> >>> >> >> >>> >> >> https://github.com/cloudera/crunch/blob/master/src/main/java/com/cloudera/crunch/lib/Join.java#L144 >>> >> >> . Also you are not setting the partitioner class. Seems to me the >>> >> >> code >>> >> >> is >>> >> >> all there, it's just not linked up correctly. >>> >> >> For the performance, maybe you should have different versions of the >>> >> >> comparator / partitioner for different key types instead of writing >>> >> >> the >>> >> >> key >>> >> >> just to compare it. String comparison for example only checks >>> >> >> characters >>> >> >> until it finds a difference. >>> >> >> >>> >> >> I encountered lots of problems with SourceTargetHelper in 0.2.4. I >>> >> >> know >>> >> >> you changed it since then, but I think I also had troubles with the >>> >> >> new >>> >> >> version. Does it support using glob patterns or directories as >>> >> >> input? >>> >> >> In any >>> >> >> case, it should not prevent the program from running imho. I spent >>> >> >> quite a >>> >> >> while just trying to work around that bug. >>> >> >> >>> >> >> Regards, >>> >> >> Stefan Ackermann >>> >> > >>> >> > >>> >> >>> >> >>> >> -- >>> >> Director of Data Science >>> >> Cloudera >>> >> Twitter: @josh_wills >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera >>> Twitter: @josh_wills > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills -- Director of Data Science Cloudera Twitter: @josh_wills
