Hi Ethan, I just tried the script on a toy data and I could reproduce this erroneous behavior when run in Hadoop mode -- both local and Spark modes are good. I will look into it.
BTW, you forgot to attach the scripts. Shirish On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <[email protected]> wrote: > OK this is interesting: > > Scenario 1 > I slightly modified 'sample.dml' to add statements to print dimensions of > SM, P and iX, and ran it on the same data. The dimensions AND the output > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of > original data. > > Please see attached: > sample-debug.dml: > sample.dml with 3 print functions inserted > train-test-debug_1.mtd > train-test-debug_2.mtd: > meta data of outputs. Note 'rows' are correct. > > > Scenario 2 > This is confusing so I commented out the 'print' statements in > 'sample.dml' and ran it on the same data, and the output were INCORRECT. > That is, subset '1' and '2' contain the same rows as the original data. > > Please see attached: > Please see attached: > sample-debug-noprint.dml: > 3 print functions were commented out > train-test-debug-noprint_1.mtd > train-test-debug-noprint_2.mtd > meta data of outputs. Note 'rows' are incorrect. > > There was no errors in either trials. > > Ethan > > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <[email protected]> wrote: > >> Hello, >> >> I encountered an unexpected behavior from 'sample.dml' on a dataset on >> Hadoop. Instead of splitting the data, it replaced rows of original data >> with 0's. Here are the details: >> >> I called sample.dml in attempt to split is a 35 million by 2396 numeric >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both >> still contain 35 million rows, instead of 35*80% and 35*20% rows. >> >> However it looks like 20% of the rows in '1' are replaced with 0's (but >> not removed). It is as if line 66 of sample.dml ( >> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml) >> that calls removeEmpty() doesn't exist. >> >> Here is the submission script: >> >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols": >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd >> >> ## Split file to training and test sets >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml >> -config=$sysConfCust -nvargs X=/path/originalData.csv >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv >> >> >> There was no error messages and all MR jobs were executed successfully. >> What other information can I provide to diagnose the issue? >> >> Thanks, >> >> Ethan >> >> >> >> >> >
