Hey Micah, This should fix the join issue: https://issues.apache.org/jira/browse/CRUNCH-160
Let me know if it works for you. J On Wed, Jan 30, 2013 at 6:08 AM, Josh Wills <[email protected]> wrote: > Okay, good to know. I'll be back in SF on Friday and will sit down w/some > of my friends who know HBase better than I do and take another look. > > J > > > On Tue, Jan 29, 2013 at 9:12 AM, Micah Whitacre <[email protected]>wrote: > >> Unfortunately it doesn't look like this is just a test failure as >> running against a CDH4.1.1 cluster fails in the exact same manner. >> Here is a copy of the code I used[1] >> >> [1] - http://pastebin.com/QLEc5fmG >> >> On Tue, Jan 29, 2013 at 8:44 AM, Micah Whitacre <[email protected]> >> wrote: >> > The problem of reading from the same table twice seems interesting. >> > At one point when trying to figure out the problem I tweaked the test >> > to run the joinedTable through the same wordCount steps to make sure >> > everything was read and then persisted correctly. So the flow of the >> > test became: >> > >> > write to wordcount table >> > wordcount >> > write to join table >> > wordcount the join table (output to a different table) >> > attempt to join words with others. >> > >> > That flow would work as expected but still fail on the last join. So >> > it seems like it would be reading in correctly from HBase. >> > >> > I am working on building a stand alone example and will report back >> > the findings. >> > >> > thanks for your help, >> > micah >> > >> > >> > On Mon, Jan 28, 2013 at 11:55 PM, Josh Wills <[email protected]> >> wrote: >> >> I have to call it a night, but this is an odd one. >> >> >> >> The basic problem seems to be that we are reading from the same table >> >> twice-- it seems like the HTable object is the same on both splits >> (always >> >> reading from the words table, or always reading from the joinTableName >> >> table), but the Scan object appears to get updated. I verified this by >> using >> >> a different column family on the joinTableName table and seeing that >> the >> >> test returned no output for the join, which is what we would expect if >> one >> >> of the reads had no input. >> >> >> >> Looking in the code, I don't see a place where the 0.92.1 and 0.90.4 >> code >> >> differ significantly in terms of the input format, record reader, etc. >> I'm >> >> on the road this week, but I'd like to work on this one some more when >> I'm >> >> back in SF and can sit down with my co-workers who know more HBase >> than I >> >> do. >> >> >> >> Out of curiousity-- is it just the unit test that fails, or can you >> run a >> >> real HBase MR job that suffers from this problem? >> >> >> >> J >> >> >> >> >> >> On Mon, Jan 28, 2013 at 7:26 PM, Josh Wills <[email protected]> >> wrote: >> >>> >> >>> Ack, sorry-- was checking email on my phone and didn't see the patch. >> I >> >>> can replicate it locally, digging in now. >> >>> >> >>> >> >>> On Mon, Jan 28, 2013 at 6:47 PM, Whitacre,Micah >> >>> <[email protected]> wrote: >> >>>> >> >>>> The patch should contain the specifics but I've tested using 4.1.1, >> >>>> 4.1.2, and 4.1.3. Each gives the same results. >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Jan 28, 2013, at 20:44, "Josh Wills" <[email protected]> wrote: >> >>>> >> >>>> I usually run them in Eclipse, but not using a particularly special >> run >> >>>> configuration (I think.) Let me see if I can replicate that one-- >> which CDH >> >>>> version? >> >>>> >> >>>> >> >>>> On Mon, Jan 28, 2013 at 3:13 PM, Micah Whitacre < >> [email protected]> >> >>>> wrote: >> >>>>> >> >>>>> Related to this thread, where I asked how to save off the >> intermediate >> >>>>> state but in general how do you debug the project, specifically for >> >>>>> the IT tests? Do you typically run through Eclipse with special >> >>>>> profiles? >> >>>>> >> >>>>> I'm still trying to track down an odd failure in crunch-hbase when >> >>>>> swapping out the dependencies to use CDH4.1.x. The test failure >> seems >> >>>>> to indicate the test is joining the same PCollection on itself. >> >>>>> >> >>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: >> 63.13 >> >>>>> sec <<< FAILURE! >> >>>>> testWordCount(org.apache.crunch.io.hbase.WordCountHBaseIT) Time >> >>>>> elapsed: 62.789 sec <<< FAILURE! >> >>>>> java.lang.AssertionError: expected:<[cat,zebra, cat,donkey, >> dog,bird]> >> >>>>> but was:<[bird,bird, zebra,zebra, horse,horse, donkey,donkey]> >> >>>>> at org.junit.Assert.fail(Assert.java:93) >> >>>>> at org.junit.Assert.failNotEquals(Assert.java:647) >> >>>>> at org.junit.Assert.assertEquals(Assert.java:128) >> >>>>> at org.junit.Assert.assertEquals(Assert.java:147) >> >>>>> at >> >>>>> >> org.apache.crunch.io.hbase.WordCountHBaseIT.run(WordCountHBaseIT.java:257) >> >>>>> at >> >>>>> >> org.apache.crunch.io.hbase.WordCountHBaseIT.testWordCount(WordCountHBaseIT.java:202) >> >>>>> >> >>>>> and sometimes: >> >>>>> >> >>>>> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: >> 71.958 >> >>>>> sec <<< FAILURE! >> >>>>> testWordCount(org.apache.crunch.io.hbase.WordCountHBaseIT) Time >> >>>>> elapsed: 71.469 sec <<< FAILURE! >> >>>>> java.lang.AssertionError: expected:<[cat,zebra, cat,donkey, >> dog,bird]> >> >>>>> but was:<[dog,dog, cat,cat]> >> >>>>> at org.junit.Assert.fail(Assert.java:93) >> >>>>> at org.junit.Assert.failNotEquals(Assert.java:647) >> >>>>> at org.junit.Assert.assertEquals(Assert.java:128) >> >>>>> at org.junit.Assert.assertEquals(Assert.java:147) >> >>>>> at >> >>>>> >> org.apache.crunch.io.hbase.WordCountHBaseIT.run(WordCountHBaseIT.java:259) >> >>>>> at >> >>>>> >> org.apache.crunch.io.hbase.WordCountHBaseIT.testWordCount(WordCountHBaseIT.java:202) >> >>>>> >> >>>>> Most likely due to the same reason Crunch requires a special build >> of >> >>>>> HBase 0.94.1, I've found I need to mix and match CDH4 versions as >> >>>>> shown by the attached patch. For the Crunch core build I need to >> use >> >>>>> all of the latest 2.0.0 code but for testing crunch-hbase I need to >> >>>>> use the mrv1 fork for hadoop-core and hadoop-minicluster. I >> wouldn't >> >>>>> think that either of those would affect the tests unless somehow the >> >>>>> files used for the intermediate states were not being temporarily >> >>>>> stored correctly. The fact that the test fails differently does >> make >> >>>>> me wonder about a concurrency issue but I'm not sure where. >> >>>>> >> >>>>> Any pointers on debugging would be helpful. >> >>>>> Micah >> >>>>> >> >>>>> On Thu, Jan 24, 2013 at 2:24 PM, Micah Whitacre < >> [email protected]> >> >>>>> wrote: >> >>>>> > I am creating an entirely new profile simply to keep my changes >> >>>>> > separate from what is in apache/master. >> >>>>> > >> >>>>> > Thanks for the hint about the "naive" approach. Previously I had >> the >> >>>>> > following: >> >>>>> > >> >>>>> > <hadoop.version>2.0.0-cdh4.1.1</hadoop.version> >> >>>>> > >> >>>>> > <hadoop.client.version>2.0.0-mr1-cdh4.1.1</hadoop.client.version> >> >>>>> > <hbase.version>0.92.1-cdh4.1.1</hbase.version> >> >>>>> > >> >>>>> > If I follow what you did and change it to: >> >>>>> > >> >>>>> > <hadoop.version>2.0.0-cdh4.1.1</hadoop.version> >> >>>>> > >> >>>>> > <hadoop.client.version>2.0.0-cdh4.1.1</hadoop.client.version> >> >>>>> > <hbase.version>0.92.1-cdh4.1.1</hbase.version> >> >>>>> > >> >>>>> > The build gets farther. I now have a different failure in >> >>>>> > crunch-hbase I'll start working on. >> >>>>> > >> >>>>> > Thanks for your help. >> >>>>> > Micah >> >>>>> > >> >>>>> > >> >>>>> > On Thu, Jan 24, 2013 at 12:23 PM, Josh Wills <[email protected] >> > >> >>>>> > wrote: >> >>>>> >> Micah, >> >>>>> >> >> >>>>> >> I did the naive thing and just swapped in 2.0.0-cdh4.1.2 for >> >>>>> >> 2.0.0-alpha in >> >>>>> >> the crunch.platform=2 profile in the top level POM and then >> added in >> >>>>> >> the >> >>>>> >> Cloudera repositories. That works for me-- does it work for you? >> It >> >>>>> >> sounds >> >>>>> >> to me like you're creating an entirely new profile. >> >>>>> >> >> >>>>> >> J >> >>>>> >> >> >>>>> >> >> >>>>> >> On Thu, Jan 24, 2013 at 7:58 AM, Micah Whitacre >> >>>>> >> <[email protected]> >> >>>>> >> wrote: >> >>>>> >>> >> >>>>> >>> running dependency:tree on both projects shows that the version >> of >> >>>>> >>> Avro is 1.7.0 for running under both profiles. I wish it was >> that >> >>>>> >>> easy. :) >> >>>>> >>> >> >>>>> >>> On Thu, Jan 24, 2013 at 9:53 AM, Josh Wills < >> [email protected]> >> >>>>> >>> wrote: >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > On Thu, Jan 24, 2013 at 6:40 AM, Micah Whitacre >> >>>>> >>> > <[email protected]> >> >>>>> >>> > wrote: >> >>>>> >>> >> >> >>>>> >>> >> Taking a step back and comparing what is being generated for >> a >> >>>>> >>> >> normal >> >>>>> >>> >> successful test run of "-Dcrunch.platform=2" I do see a p1 >> and p2 >> >>>>> >>> >> directory being created, with the expected materialized >> output >> >>>>> >>> >> being >> >>>>> >>> >> in the p1 directory. So I'm still curious about tracking >> all of >> >>>>> >>> >> the >> >>>>> >>> >> intermediate state but it doesn't look like it is an issue >> with >> >>>>> >>> >> regard >> >>>>> >>> >> to creating the output in the wrong directory. >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > That's a relief. :) >> >>>>> >>> > >> >>>>> >>> > I think the issue with temp outputs has to do with our use of >> the >> >>>>> >>> > TemporaryPath libraries for creating, well, temporary paths. >> We do >> >>>>> >>> > this >> >>>>> >>> > so >> >>>>> >>> > we play nicely with CI frameworks, but you might need to >> disable >> >>>>> >>> > it for >> >>>>> >>> > investigating intermediate outputs. >> >>>>> >>> > >> >>>>> >>> > Re: the specific error you're seeing, that looks interesting. >> I >> >>>>> >>> > wonder >> >>>>> >>> > if >> >>>>> >>> > it's an Avro version change or some such thing. Will see if I >> can >> >>>>> >>> > replicate >> >>>>> >>> > it. >> >>>>> >>> > >> >>>>> >>> > >> >>>>> >>> > -- >> >>>>> >>> > Director of Data Science >> >>>>> >>> > Cloudera >> >>>>> >>> > Twitter: @josh_wills >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> -- >> >>>>> >> Director of Data Science >> >>>>> >> Cloudera >> >>>>> >> Twitter: @josh_wills >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Director of Data Science >> >>>> Cloudera >> >>>> Twitter: @josh_wills >> >>>> >> >>>> CONFIDENTIALITY NOTICE This message and any included attachments are >> from >> >>>> Cerner Corporation and are intended only for the addressee. The >> information >> >>>> contained in this message is confidential and may constitute inside >> or >> >>>> non-public information under international, federal, or state >> securities >> >>>> laws. Unauthorized forwarding, printing, copying, distribution, or >> use of >> >>>> such information is strictly prohibited and may be unlawful. If you >> are not >> >>>> the addressee, please promptly delete this message and notify the >> sender of >> >>>> the delivery error by e-mail or you may call Cerner's corporate >> offices in >> >>>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Director of Data Science >> >>> Cloudera >> >>> Twitter: @josh_wills >> >> >> >> >> >> >> >> >> >> -- >> >> Director of Data Science >> >> Cloudera >> >> Twitter: @josh_wills >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
