What exactly is you goal? Taking those names and de-duping to see which are
talking about the same thing?
Here is an example of weird data. A refurbished iPhone 5C for 4589.0?
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
Honestly I wouldn’t know where to begin.
On Aug 13, 2015, at 6:53 AM, David Kaplan wrote:
Hi Pat,
Thanks for the reply,
Yes I think there are a lot of problems,
So there are 4 data sources, they each use different categorisation
conventions, some one level,some multilevel,
so I basically picked one source that is about 500K of the entries out of 3
million,
I do have the prices, the data is separated in solr, so i can extract
title, category and price.
My confusion is trying to work out classifier vs clustering as I understand
it clustering is when you don't
have labelled data, but I do for some. Am i looking for a hybrid
classifier/clustering - kmeans or is just SVM sufficient?
To make matters more complicated they are categories and then
sub-categories, so "Cell Phones & Accessories" => "Accesories" ,
Don't know if that means i have train separate models?
Example data snippet:
"2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 -
White","Cell Phones & Accessories - Accessories",529.0
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
"Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking -
Printers",375.0
"Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0
Perhaps I'm overcomplicating the problem...
Many thanks,
David
On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel wrote:
> You have a lot of problems to solve here.
>
> 1) can you find the price? Is it in text or in structured data? If text
> you have an NLP problem. You can use regex for price.
> 2) how do you associate a price with the object, there may be several
> money amounts in the ad. Some do this with proximity so how many
> chanracters away from the item id is the price.
> 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6
> plus", and it gets worse for things with lots of numbers and modifiers in
> the name like "super whiz bang deLux 5G XLS” The right level of
> de-duplication vs fragmentation is a deep and hard problem.
>
> How much is an NLP problem and what structure does the data have? Unless I
> misunderstand your problem, extracting the data will be the hardest part
> and not something Mahout can help with.
>
> On Aug 12, 2015, at 5:49 AM, David Kaplan wrote:
>
> Hi all,
> Hope someone can please point me in the right direction,
> Very new to mahout..
> Here's my scenario:
>
> I have written a system that collects Classifieds items from multiple
> websites - phones,cars,antiques and many more using scrapy, all the items
> are then ingested into Solr - +- 3 million entries.
> This is then the backend for my search engine
>
> I want to be able to extract meaningful information to accurately
> calculate realistic price average etc. I need guidance/perhaps examples in
> accurate outlier detection, categorization etc extreme beginner in machine
> learning so need to know if that's what I should be using
>
> Part of my challenge is the broad range of items/categories, different
> levels of skewed data etc. e.g. finding outliers with "iphone" results when
> many of those are cheap iphone accessories.
>
> Basically it seems i need to cluster/classify but not sure exactly how to
> go about it, because i do already have the categories for 500K of the
> entries, example category "Cell Phones & Accessories - Accessories"
>
> And then actually connecting Mahout to Solr...
>
> Many thanks!
> David
>
>