Hi all! I am in the process of running some tests for online machine learning in data streams from social media. I came across apache-SAMOA and seemed like a very interesting framework. However it was not possible to figure out how to get it to test and train using a sparse array of tf-idf feature vectors. I provide the data in the standard WEKA arff format and although it run, the output is something along the lines of:
2015-05-12 22:58:58,993 [main] INFO > com.yahoo.labs.samoa.evaluation.EvaluatorProcessor > (EvaluatorProcessor.java:189) - > com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0 > evaluation instances,classified instances,classifications correct > (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent) > 100.0,100.0,100.0,100.0,? > 200.0,200.0,100.0,100.0,? > 300.0,300.0,100.0,100.0,? > 400.0,400.0,100.0,100.0,? > 500.0,500.0,100.0,100.0,? > 600.0,600.0,100.0,100.0,? > 700.0,700.0,100.0,100.0,? > 800.0,800.0,100.0,100.0,? > 900.0,900.0,100.0,100.0,? > 1000.0,1000.0,100.0,100.0,? > 1100.0,1100.0,100.0,100.0,? > 1200.0,1200.0,100.0,100.0,? > 1300.0,1300.0,100.0,100.0,? > 1400.0,1400.0,100.0,100.0,? > 1500.0,1500.0,100.0,100.0,? > 1600.0,1600.0,100.0,100.0,? > 1700.0,1700.0,100.0,100.0,? > 1800.0,1800.0,100.0,100.0,? > 1900.0,1900.0,100.0,100.0,? I have read the documentation on the SAMOA project page but I wasn't able to figure out how to get classification results per instance. Could you please point me to the right direction in terms of acceptable formats SAMOA can use as stream input ? Is there a need for a labeled training set to be included in the data ? Any examples you could provide me with that are not already in the documentation would be most welcome! Kind Regards, Ilias Bertsimas.
