Re: Time Series Stuff

2015-08-14 Thread Andrew Musselman
Agreed; let us know if you want some help getting started.

On Friday, August 14, 2015, Dmitriy Lyubimov  wrote:

> Not that I know of. would be nice to have.
>
> On Fri, Aug 14, 2015 at 4:42 PM, Nick Kolegraff  >
> wrote:
>
> > Hey Mahouts,
> > Looking for some time series analysis stuff I can use in mahout.  I don't
> > see much, other than this legacy HMM stuff.
> >
> > https://mahout.apache.org/users/classification/hidden-markov-models.html
> >
> > Are plans in the works on developing out more time series analysis and
> > functionality and/or already exists?  '11 is the last commit that
> mentions
> > HMMs.  "git log -Shmm"
> >
> > Thanks,
> > Nick
> >
>


Re: Time Series Stuff

2015-08-14 Thread Dmitriy Lyubimov
Not that I know of. would be nice to have.

On Fri, Aug 14, 2015 at 4:42 PM, Nick Kolegraff 
wrote:

> Hey Mahouts,
> Looking for some time series analysis stuff I can use in mahout.  I don't
> see much, other than this legacy HMM stuff.
>
> https://mahout.apache.org/users/classification/hidden-markov-models.html
>
> Are plans in the works on developing out more time series analysis and
> functionality and/or already exists?  '11 is the last commit that mentions
> HMMs.  "git log -Shmm"
>
> Thanks,
> Nick
>


Time Series Stuff

2015-08-14 Thread Nick Kolegraff
Hey Mahouts,
Looking for some time series analysis stuff I can use in mahout.  I don't
see much, other than this legacy HMM stuff.

https://mahout.apache.org/users/classification/hidden-markov-models.html

Are plans in the works on developing out more time series analysis and
functionality and/or already exists?  '11 is the last commit that mentions
HMMs.  "git log -Shmm"

Thanks,
Nick


Re: Mahout Clustering Help Please

2015-08-14 Thread Pat Ferrel
What exactly is you goal? Taking those names and de-duping to see which are 
talking about the same thing?

Here is an example of weird data. A refurbished iPhone 5C for 4589.0?

"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0

Honestly I wouldn’t know where to begin.


On Aug 13, 2015, at 6:53 AM, David Kaplan  wrote:

Hi Pat,
Thanks for the reply,
Yes I think there are a lot of problems,

So there are 4 data sources, they each use different categorisation
conventions, some one level,some multilevel,
so I basically picked one source that is about 500K of the entries out of 3
million,

I do have the prices, the data is separated in solr, so i can extract
title, category and price.

My confusion is trying to work out classifier vs clustering as I understand
it clustering is when you don't
have labelled data, but I do for some. Am i looking for a hybrid
classifier/clustering - kmeans or is just SVM sufficient?

To make matters more complicated they are categories and then
sub-categories, so "Cell Phones & Accessories" => "Accesories" ,
Don't know if that means i have train separate models?

Example data snippet:

"2800mAh External Battery Backup Power Bank and Leather Case for iPhone 5 -
White","Cell Phones & Accessories - Accessories",529.0
"Apple iPhone 5C 16GB (Green) - Refurbished","Cell Phones & Accessories -
Cell Phones & Smartphones",4589.0
"Orange PLA 3D Printer Filament 1.75mm 1kg","Computers & Networking -
Printers",375.0
"Canon LV-7292 S Projector","Electronics - TVs & Projectors",6998.0

Perhaps I'm overcomplicating the problem...

Many thanks,
David



On Thu, Aug 13, 2015 at 3:35 AM, Pat Ferrel  wrote:

> You have a lot of problems to solve here.
> 
> 1) can you find the price? Is it in text or in structured data? If text
> you have an NLP problem. You can use regex for price.
> 2) how do you associate a price with the object, there may be several
> money amounts in the ad. Some do this with proximity so how many
> chanracters away from the item id is the price.
> 3) can you find the item id? Some say iphone, some iPhone, some "iPhone 6
> plus", and it gets worse for things with lots of numbers and modifiers in
> the name like "super whiz bang deLux 5G XLS” The right level of
> de-duplication vs fragmentation is a deep and hard problem.
> 
> How much is an NLP problem and what structure does the data have? Unless I
> misunderstand your problem, extracting the data will be the hardest part
> and not something Mahout can help with.
> 
> On Aug 12, 2015, at 5:49 AM, David Kaplan  wrote:
> 
> Hi all,
> Hope someone can please point me in the right direction,
> Very new to mahout..
> Here's my scenario:
> 
> I have written a system that collects Classifieds items from multiple
> websites - phones,cars,antiques and many more using scrapy, all the items
> are then ingested into Solr - +- 3 million entries.
> This is then the backend for my search engine
> 
> I want to be able to extract meaningful information to accurately
> calculate realistic price average etc. I need guidance/perhaps examples in
> accurate outlier detection, categorization etc extreme beginner in machine
> learning so need to know if that's what I should be using
> 
> Part of my challenge is the broad range of items/categories, different
> levels of skewed data etc. e.g. finding outliers with "iphone" results when
> many of those are cheap iphone accessories.
> 
> Basically it seems i need to cluster/classify but not sure exactly how to
> go about it, because i do already have the categories for 500K of the
> entries, example category "Cell Phones & Accessories - Accessories"
> 
> And then actually connecting Mahout to Solr...
> 
> Many thanks!
> David
> 
>