Re: 'sample.dml' replaces rows with 0's

Shirish Tatikonda Thu, 14 Apr 2016 20:44:05 -0700

Hi Ethan,

I just tried the script on a toy data and I could reproduce this erroneous
behavior when run in Hadoop mode -- both local and Spark modes are good. I
will look into it.


BTW, you forgot to attach the scripts.

Shirish

On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <[email protected]> wrote:

> OK this is interesting:
>
> Scenario 1
> I slightly modified 'sample.dml' to add statements to print dimensions of
> SM, P and iX, and ran it on the same data. The dimensions AND the output
> were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> original data.
>
> Please see attached:
> sample-debug.dml:
> sample.dml with 3 print functions inserted
> train-test-debug_1.mtd
> train-test-debug_2.mtd:
> meta data of outputs. Note 'rows' are correct.
>
>
> Scenario 2
> This is confusing so I commented out the 'print' statements in
> 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> That is, subset '1' and '2' contain the same rows as the original data.
>
> Please see attached:
> Please see attached:
> sample-debug-noprint.dml:
> 3 print functions were commented out
> train-test-debug-noprint_1.mtd
> train-test-debug-noprint_2.mtd
> meta data of outputs. Note 'rows' are incorrect.
>
> There was no errors in either trials.
>
> Ethan
>
> On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <[email protected]> wrote:
>
>> Hello,
>>
>> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> Hadoop. Instead of splitting the data, it replaced rows of original data
>> with 0's. Here are the details:
>>
>> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2' both
>> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>>
>> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> not removed). It is as if line 66 of sample.dml (
>> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml)
>> that calls removeEmpty() doesn't exist.
>>
>> Here is the submission script:
>>
>> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>>
>> ## Split file to training and test sets
>> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>>
>>
>> There was no error messages and all MR jobs were executed successfully.
>> What other information can I provide to diagnose the issue?
>>
>> Thanks,
>>
>> Ethan
>>
>>
>>
>>
>>
>

Re: 'sample.dml' replaces rows with 0's

Reply via email to