Hi Pat, Thanks for the reply, Yes I think there are a lot of problems, So there are 4 data sources, they each use different categorisation conventions, some one level,some multilevel, so I basically picked one source that is about 500K of the entries out of 3 million,
I do have the prices, the data is separated in solr, so i can extract title, category and price. My confusion is trying to work out classifier vs clustering as I understand it clustering is when you don't have labelled data, but I do for some. Am i looking for a hybrid classifier/clustering - kmeans or is just SVM sufficient? To make matters more complicated they are categories and then sub-categories, so "Cell Phones & Accessories" => "Accesories" , Don't know if that means i have train separate models? Example data snippet: "2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 - White","Cell Phones & Accessories - Accessories",529.0 "Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories - Cell Phones & Smartphones",4589.0 "Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking - Printers",375.0 "Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0 Perhaps I'm overcomplicating the problem... Many thanks, David On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel <[email protected]> wrote: > You have a lot of problems to solve here. > > 1) can you find the price? Is it in text or in structured data? If text > you have an NLP problem. You can use regex for price. > 2) how do you associate a price with the object, there may be several > money amounts in the ad. Some do this with proximity so how many > chanracters away from the item id is the price. > 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6 > plus", and it gets worse for things with lots of numbers and modifiers in > the name like "super whiz bang deLux 5G XLS” The right level of > de-duplication vs fragmentation is a deep and hard problem. > > How much is an NLP problem and what structure does the data have? Unless I > misunderstand your problem, extracting the data will be the hardest part > and not something Mahout can help with. > > On Aug 12, 2015, at 5:49 AM, David Kaplan <[email protected]> wrote: > > Hi all, > Hope someone can please point me in the right direction, > Very new to mahout.. > Here's my scenario: > > I have written a system that collects Classifieds items from multiple > websites - phones,cars,antiques and many more using scrapy, all the items > are then ingested into Solr - +- 3 million entries. > This is then the backend for my search engine > > I want to be able to extract meaningful information to accurately > calculate realistic price average etc. I need guidance/perhaps examples in > accurate outlier detection, categorization etc extreme beginner in machine > learning so need to know if that's what I should be using > > Part of my challenge is the broad range of items/categories, different > levels of skewed data etc. e.g. finding outliers with "iphone" results when > many of those are cheap iphone accessories. > > Basically it seems i need to cluster/classify but not sure exactly how to > go about it, because i do already have the categories for 500K of the > entries, example category "Cell Phones & Accessories - Accessories" > > And then actually connecting Mahout to Solr... > > Many thanks! > David > >
