Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-12 Thread Eric Charles
I don't mind building something, I don't know where to start.

What are some keywords to look for, or some articles to start from?

I'm asking here exactly because neither I, nor the two data scientists who
now ostensibly work for me, seem to be able to figure out where to start at
it.

(Obviously I would have preferred to find that there WAS something
out-of-the-box, I'm happy for anything that is appreciably ahead of
starting-from-scratch, because if that's where we are, it's never
happening.)



On Thu, Jan 12, 2023 at 9:51 AM glen  wrote:

> Well, it *is* a "thing". We're doing something very similar on our
> project, classifying patient types. It's just that there's no
> standard/generic/singular way to do it. I get the feeling you're looking
> for some sort of black box process you can blindly apply. And that's not a
> thing. But there's loads of research and methods on how to classify such
> things. Which one will actually work with your data is a question only
> those looking at the data can answer.
>
> You could anonymize that data and post it here (or wherever) and hold a
> contest to see who gets the best classifier. Offer a $1000 reward. 8^D
>
> On 1/11/23 18:08, Eric Charles wrote:
> > I'm also trying to do some sort of career classification game,
> originally because I thought it wouldn't be too hard. IF it was possible to
> do the career classification game, it would de facto assist with the
> attrition prediction. But if that just kind of isn't a thing, then I guess
> it isn't a thing .
>
> --
> ꙮ Mɥǝu ǝlǝdɥɐuʇs ɟᴉƃɥʇ' ʇɥǝ ƃɹɐss snɟɟǝɹs˙ ꙮ
>
> -. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
> FRIAM Applied Complexity Group listserv
> Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom
> https://bit.ly/virtualfriam
> to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
> FRIAM-COMIC http://friam-comic.blogspot.com/
> archives:  5/2017 thru present
> https://redfish.com/pipermail/friam_redfish.com/
>   1/2003 thru 6/2021  http://friam.383.s1.nabble.com/
>
-. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
FRIAM Applied Complexity Group listserv
Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom 
https://bit.ly/virtualfriam
to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives:  5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/
  1/2003 thru 6/2021  http://friam.383.s1.nabble.com/


Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-12 Thread glen

Well, it *is* a "thing". We're doing something very similar on our project, 
classifying patient types. It's just that there's no standard/generic/singular way to do 
it. I get the feeling you're looking for some sort of black box process you can blindly 
apply. And that's not a thing. But there's loads of research and methods on how to 
classify such things. Which one will actually work with your data is a question only 
those looking at the data can answer.

You could anonymize that data and post it here (or wherever) and hold a contest 
to see who gets the best classifier. Offer a $1000 reward. 8^D

On 1/11/23 18:08, Eric Charles wrote:

I'm also trying to do some sort of career classification game, originally because I 
thought it wouldn't be too hard. IF it was possible to do the career classification 
game, it would de facto assist with the attrition prediction. But if that just kind 
of isn't a thing, then I guess it isn't a thing .


--
ꙮ Mɥǝu ǝlǝdɥɐuʇs ɟᴉƃɥʇ' ʇɥǝ ƃɹɐss snɟɟǝɹs˙ ꙮ

-. --- - / ...- .- .-.. .. -.. / -- --- .-. ... . / -.-. --- -.. .
FRIAM Applied Complexity Group listserv
Fridays 9a-12p Friday St. Johns Cafe   /   Thursdays 9a-12p Zoom 
https://bit.ly/virtualfriam
to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives:  5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/
 1/2003 thru 6/2021  http://friam.383.s1.nabble.com/


Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-11 Thread Eric Charles
Yeah, there are two different efforts I'm trying to play with
simultaneously in that area... in addition to the 4 or 5 efforts in
unrelated areas


We ARE trying to do a relatively clean attrition-prediction model, and that
will likely be something like what you were suggesting at the end.

I'm also trying to do some sort of career classification game, originally
because I thought it wouldn't be too hard. IF it was possible to do the
career classification game, it would de facto assist with the attrition
prediction. But if that just kind of isn't a thing, then I guess it isn't a
thing .

Eventually, I'd really like to do agent-based modeling of the entire
federal workforce... which I think the career-typing would also help
with... but that's quite a ways off... not sure if it will ever happen...
or if I could clearly articulate value... but I THINK it would let me do
cool things...

I'm hoping to be at Thuram for a bit tomorrow, but I've been in two days of
Agile/Scrum training  so I'm not sure how swamped I'll be tomorrow.



On Tue, Jan 10, 2023 at 1:24 PM glen  wrote:

> One tangential solution I've seen work well enough in synthetic health
> data is to treat the longitudinal data as a sequence in the same way the
> LLMs treat text. Rather than focus on the 2nd problem EricC mentioned
> (clustering based on *similarity*), focus more on the 1st ("around 10
> different types of changes that could happen").
>
> I suggest this because "we've" made lots of progress in such sequence
> prediction, and less progress in *object* detection. By asking for "types
> of career", you're conflating the two problems, lowering the efficacy of
> the sequential task and perhaps raising the efficacy of the object
> detection task.
>
> But a second idea might come from Russ' comment. Obviously, these are
> fairly well integrated humans. (My rants against narrativity aside.) So the
> causes of one seemingly independent feature *are* intertwined with the
> causes of some other independent feature. You can imagine a graph that
> includes all the features as nodes, but that also includes something like a
> Markov blanket just inside those measured features, and an internal causal
> kernel inside that blanket. This is akin to structural equation modeling et
> al. What you're looking for is a reconstructed state space of any human in
> the database ... what the cool kids are calling a "digital twin", these
> days.
>
> And, given your (Eric's) rants about bureaucracy and stupidity being at
> least in part due to the system these agents navigate, you *might* be able
> to keep the model simple by modeling the *options* any given agent might
> have ... as defined by the space. I.e. model the environment, the dual of
> the agent(s). We did that in a model for optimizing building and urban
> evacuation using so-called stupid agents.
>
> If you're looking for the One True Algorithm, my guess is you'll get lost.
> But if you do find it, write an Excel macro for the rest of us. 8^D
>
> On 1/10/23 09:16, Russ Abbott wrote:
> > Interesting problem.
> >
> > Eric, as you said earlier, K-means requires a way to measure the
> distance between objects -- so that those with smaller distances can be
> grouped together. A problem is that there are a number of features, which
> may not be correlated. For example, there is an income trajectory, a change
> of company trajectory, a change of level-of-responsibility trajectory, a
> change of subject-matter-focus trajectory, and probably more.  You might
> build separate trajectories for each person and then see if you can group
> the trajectories. For example, a "company man" may or may not have an
> increasing responsibility trajectory. You would then have a
> multi-dimensional space into which to put people.
> > _
> > _
> > __-- Russ
> >
> >
> > On Mon, Jan 9, 2023 at 10:11 AM Nicholas Thompson <
> thompnicks...@gmail.com > wrote:
> >
> > To my uneducated eye, this seemed like one of Jon’s problems.
> >
> > Sent from my Dumb Phone
> >
> > On Jan 7, 2023, at 6:23 AM, Frank Wimberly  > wrote:
> >
> > 
> > This answer seems reasonable to me.  I worked on Project Talent
> during 1967 which had some similar goals and data.  See
> >
> > https://en.m.wikipedia.org/wiki/Project_Talent <
> https://en.m.wikipedia.org/wiki/Project_Talent>
> >
> > Our data was for thousands of highschool students and our software
> was all written in Fortran.
> >
> > ---
> > Frank C. Wimberly
> > 140 Calle Ojo Feliz,
> > Santa Fe, NM 87505
> >
> > 505 670-9918
> > Santa Fe, NM
> >
> > On Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp <
> piet...@randcontrols.co.za > wrote:
> >
> > I asked https://chat.openai.com/chat <
> https://chat.openai.com/chat> and here is the conversation:
> >
> > *Pieter Steenekamp*
> > can you suggest a solution for the 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-10 Thread glen

One tangential solution I've seen work well enough in synthetic health data is to treat 
the longitudinal data as a sequence in the same way the LLMs treat text. Rather than 
focus on the 2nd problem EricC mentioned (clustering based on *similarity*), focus more 
on the 1st ("around 10 different types of changes that could happen").

I suggest this because "we've" made lots of progress in such sequence prediction, and 
less progress in *object* detection. By asking for "types of career", you're conflating 
the two problems, lowering the efficacy of the sequential task and perhaps raising the efficacy of 
the object detection task.

But a second idea might come from Russ' comment. Obviously, these are fairly well 
integrated humans. (My rants against narrativity aside.) So the causes of one seemingly 
independent feature *are* intertwined with the causes of some other independent feature. 
You can imagine a graph that includes all the features as nodes, but that also includes 
something like a Markov blanket just inside those measured features, and an internal 
causal kernel inside that blanket. This is akin to structural equation modeling et al. 
What you're looking for is a reconstructed state space of any human in the database ... 
what the cool kids are calling a "digital twin", these days.

And, given your (Eric's) rants about bureaucracy and stupidity being at least 
in part due to the system these agents navigate, you *might* be able to keep 
the model simple by modeling the *options* any given agent might have ... as 
defined by the space. I.e. model the environment, the dual of the agent(s). We 
did that in a model for optimizing building and urban evacuation using 
so-called stupid agents.

If you're looking for the One True Algorithm, my guess is you'll get lost. But 
if you do find it, write an Excel macro for the rest of us. 8^D

On 1/10/23 09:16, Russ Abbott wrote:

Interesting problem.

Eric, as you said earlier, K-means requires a way to measure the distance between objects 
-- so that those with smaller distances can be grouped together. A problem is that there 
are a number of features, which may not be correlated. For example, there is an income 
trajectory, a change of company trajectory, a change of level-of-responsibility 
trajectory, a change of subject-matter-focus trajectory, and probably more.  You might 
build separate trajectories for each person and then see if you can group the 
trajectories. For example, a "company man" may or may not have an increasing 
responsibility trajectory. You would then have a multi-dimensional space into which to 
put people.
_
_
__-- Russ


On Mon, Jan 9, 2023 at 10:11 AM Nicholas Thompson mailto:thompnicks...@gmail.com>> wrote:

To my uneducated eye, this seemed like one of Jon’s problems.

Sent from my Dumb Phone

On Jan 7, 2023, at 6:23 AM, Frank Wimberly mailto:wimber...@gmail.com>> wrote:


This answer seems reasonable to me.  I worked on Project Talent during 1967 
which had some similar goals and data.  See

https://en.m.wikipedia.org/wiki/Project_Talent 


Our data was for thousands of highschool students and our software was all 
written in Fortran.

---
Frank C. Wimberly
140 Calle Ojo Feliz,
Santa Fe, NM 87505

505 670-9918
Santa Fe, NM

On Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp mailto:piet...@randcontrols.co.za>> wrote:

I asked https://chat.openai.com/chat  and 
here is the conversation:

*Pieter Steenekamp*
can you suggest a solution for the following problem "I'm hoping someone here could help out. Let's imagine I 
had some data where each row was a person's career. We could list major events every year.For example: 2004 they were 
highered, 2007 they get a promotion, 2010 they leave for a different company, 2012 they come back at a higher level, 2015 
get a promotion, then no change until 2022.  Let's say I had data like this for roughly 2 million people, and that there are 
around 10 different types of changes that could happen during any time period (could be yearly, quarterly, monthly, I can 
make it how I want).  I was hoping we could ask a computer to tell us if there were "types of careers" that people 
had. We could say "put all these careers into 4 buckets" or "7 buckets" based on similarity. Then we 
could look at the piles the computer made and try to make sense of them. One type might be "company man" for 
people who tend to stay in place for 20 or more years,
another type could be a "rotator", who leaves and returns every 3 years or 
so. Etc. The point is, I want a computer to make the piles for me, rather than trying to come 
up with potential piles a priori. Are there methods for doing this? I know it's a problem 
we've *talked* about a lot, but I don't know if there are solutions. "
*https://chat.openai.com/chat 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-10 Thread Russ Abbott
Interesting problem.

Eric, as you said earlier, K-means requires a way to measure the distance
between objects -- so that those with smaller distances can be grouped
together. A problem is that there are a number of features, which may not
be correlated. For example, there is an income trajectory, a change of
company trajectory, a change of level-of-responsibility trajectory, a
change of subject-matter-focus trajectory, and probably more.  You might
build separate trajectories for each person and then see if you can group
the trajectories. For example, a "company man" may or may not have an
increasing responsibility trajectory. You would then have a
multi-dimensional space into which to put people.

-- Russ


On Mon, Jan 9, 2023 at 10:11 AM Nicholas Thompson 
wrote:

> To my uneducated eye, this seemed like one of Jon’s problems.
>
> Sent from my Dumb Phone
>
> On Jan 7, 2023, at 6:23 AM, Frank Wimberly  wrote:
>
> 
> This answer seems reasonable to me.  I worked on Project Talent during
> 1967 which had some similar goals and data.  See
>
> https://en.m.wikipedia.org/wiki/Project_Talent
>
> Our data was for thousands of highschool students and our software was all
> written in Fortran.
>
> ---
> Frank C. Wimberly
> 140 Calle Ojo Feliz,
> Santa Fe, NM 87505
>
> 505 670-9918
> Santa Fe, NM
>
> On Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp <
> piet...@randcontrols.co.za> wrote:
>
>> I asked https://chat.openai.com/chat and here is the conversation:
>>
>> *Pieter Steenekamp*
>> can you suggest a solution for the following problem "I'm hoping someone
>> here could help out. Let's imagine I had some data where each row was a
>> person's career. We could list major events every year.For example: 2004
>> they were highered, 2007 they get a promotion, 2010 they leave for a
>> different company, 2012 they come back at a higher level, 2015 get a
>> promotion, then no change until 2022.  Let's say I had data like this for
>> roughly 2 million people, and that there are around 10 different types of
>> changes that could happen during any time period (could be yearly,
>> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
>> computer to tell us if there were "types of careers" that people had. We
>> could say "put all these careers into 4 buckets" or "7 buckets" based on
>> similarity. Then we could look at the piles the computer made and try to
>> make sense of them. One type might be "company man" for people who tend to
>> stay in place for 20 or more years, another type could be a "rotator", who
>> leaves and returns every 3 years or so. Etc. The point is, I want a
>> computer to make the piles for me, rather than trying to come up with
>> potential piles a priori. Are there methods for doing this? I know it's a
>> problem we've *talked* about a lot, but I don't know if there are
>> solutions. "
>> *https://chat.openai.com/chat *
>> There are several approaches that you could use to group the careers of
>> these 2 million people into clusters based on similarity. One approach is
>> to use unsupervised machine learning techniques, specifically clustering
>> algorithms, to automatically group the data into clusters. Some popular
>> clustering algorithms that you could use for this problem include k-means,
>> hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
>> Applications with Noise). You could also use dimensionality reduction
>> techniques, such as PCA (Principal Component Analysis) or t-SNE
>> (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
>> of the data and then use a clustering algorithm to group the data into
>> clusters. Another approach is to use supervised machine learning
>> techniques, where you would need to manually label a subset of the data
>> with the cluster labels, and then use this labeled data to train a
>> classifier to predict the cluster labels for the remaining data. It's worth
>> noting that these approaches will require some feature engineering, where
>> you will need to extract relevant features from the data that can be used
>> to represent the careers of the people in the dataset. Once you have
>> identified and extracted these features, you can then use them as input to
>> the machine learning algorithms to group the data into clusters.
>> *Pieter Steenekamp*
>> would it then be possible to get  for example if there were "types of
>> careers" that people had?
>> *https://chat.openai.com/chat *
>> Yes, once you have grouped the careers of the 2 million people into
>> clusters using one of the approaches I mentioned above, you can then
>> analyze the characteristics of each cluster to identify common patterns and
>> trends within each cluster. This will allow you to identify "types of
>> careers" that people have, based on the similarity of their career events.
>> For example, if one of the clusters is characterized by people who have had
>> long 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-09 Thread Nicholas Thompson
To my uneducated eye, this seemed like one of Jon’s problems.Sent from my Dumb PhoneOn Jan 7, 2023, at 6:23 AM, Frank Wimberly  wrote:This answer seems reasonable to me.  I worked on Project Talent during 1967 which had some similar goals and data.  Seehttps://en.m.wikipedia.org/wiki/Project_TalentOur data was for thousands of highschool students and our software was all written in Fortran.---Frank C. Wimberly140 Calle Ojo Feliz, Santa Fe, NM 87505505 670-9918Santa Fe, NMOn Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp  wrote:I asked https://chat.openai.com/chat and here is the conversation:Pieter Steenekampcan you suggest a solution for the following problem "I'm hoping someone here could help out. Let's imagine I had some data where each row was a person's career. We could list major events every year.For example: 2004 they were highered, 2007 they get a promotion, 2010 they leave for a different company, 2012 they come back at a higher level, 2015 get a promotion, then no change until 2022.  Let's say I had data like this for roughly 2 million people, and that there are around 10 different types of changes that could happen during any time period (could be yearly, quarterly, monthly, I can make it how I want).  I was hoping we could ask a computer to tell us if there were "types of careers" that people had. We could say "put all these careers into 4 buckets" or "7 buckets" based on similarity. Then we could look at the piles the computer made and try to make sense of them. One type might be "company man" for people who tend to stay in place for 20 or more years, another type could be a "rotator", who leaves and returns every 3 years or so. Etc. The point is, I want a computer to make the piles for me, rather than trying to come up with potential piles a priori. Are there methods for doing this? I know it's a problem we've *talked* about a lot, but I don't know if there are solutions. " https://chat.openai.com/chatThere are several approaches that you could use to group the careers of these 2 million people into clusters based on similarity. One approach is to use unsupervised machine learning techniques, specifically clustering algorithms, to automatically group the data into clusters. Some popular clustering algorithms that you could use for this problem include k-means, hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). You could also use dimensionality reduction techniques, such as PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality of the data and then use a clustering algorithm to group the data into clusters. Another approach is to use supervised machine learning techniques, where you would need to manually label a subset of the data with the cluster labels, and then use this labeled data to train a classifier to predict the cluster labels for the remaining data. It's worth noting that these approaches will require some feature engineering, where you will need to extract relevant features from the data that can be used to represent the careers of the people in the dataset. Once you have identified and extracted these features, you can then use them as input to the machine learning algorithms to group the data into clusters.Pieter Steenekampwould it then be possible to get  for example if there were "types of careers" that people had?https://chat.openai.com/chatYes, once you have grouped the careers of the 2 million people into clusters using one of the approaches I mentioned above, you can then analyze the characteristics of each cluster to identify common patterns and trends within each cluster. This will allow you to identify "types of careers" that people have, based on the similarity of their career events. For example, if one of the clusters is characterized by people who have had long tenures at a single company, you could label this cluster as "company man" careers. Similarly, if another cluster is characterized by people who frequently switch between companies, you could label this cluster as "rotator" careers. You can then use this information to gain insights into the types of careers that are represented in the dataset, and potentially draw conclusions about the factors that may influence career paths and success.Pieter Steenekampcan you propose python code to do sometingh like this pleasehttps://chat.openai.com/chatSure, here is an example of how you could use the scikit-learn library in Python to group the careers of the 2 million people into clusters using the k-means algorithm:Copy codefrom sklearn.cluster import KMeans# Load the data into a NumPy arraydata = ""># Create the k-means modelmodel = KMeans(n_clusters=4)# Fit the model to the datamodel.fit(data)# Predict the cluster labels for each data pointlabels = model.predict(data)# Print the cluster labelsprint(labels)This code will group the data into 4 clusters, as specified by the 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-09 Thread Eric Charles
>From what I can tell "one-hot encoding" is just another term for dummy
coding the data, i.e., make it a bunch of 1/0 columns. H2o seems more
promising, but seems to require a backbone of quantitative data that you
can substitute (based on something akin to a regression) for the
categorical variables.

It seems weird to me that this doesn't exist. It seems like such a basic
thing to want the computer to do for you. The lack of this reinforces my
belief that a ton of AI work is just people who don't understand basic
analytic techniques trying hard not to learn how to do them. If most of the
AI efforts are just giving me the results of a regression, that's not super
exciting when I am comfortable running regressions myself.





On Sat, Jan 7, 2023 at 3:30 PM Pieter Steenekamp 
wrote:

> One way to handle categorical input data for machine learning is to
> convert it using one-hot encoding - it's not difficult but a bit cumbersome.
> Fortunately there are other options. H2O is a machine learning library
> available in both Python and R that does this conversion "under the hood".
>
> I asked ChatGPT about "machine learning with categorical inputs using
> H2O", she answered:
>
> H2O is a popular open-source library for training machine learning models
> in Python and R. It is designed to be scalable, fast, and easy to use, and
> it includes a number of algorithms for classification, regression, and
> clustering.
>
> H2O can handle categorical variables natively, so you don't need to use
> one-hot encoding or any other special techniques to work with them. When
> you import your data into H2O, you can specify which columns are
> categorical, and H2O will take care of the rest.
>
> Here is an example of how you might use H2O to train a machine learning
> model with categorical inputs in Python:
>
> Copy code
> import h2o
> # Initialize H2O
> h2o.init()
> # Import your data into H2O
> df = h2o.import_file("path/to/your/data.csv")
> # Specify which columns are categorical
> df['input_variable'] = df['input_variable'].asfactor()
> # Train a model using the categorical input_variable column
> model = h2o.H2ORandomForestEstimator(categorical_encoding =
> "one_hot_explicit")
> model.train(["input_variable"], "output_variable", df)
> Here is an example of how you might use H2O to train a machine learning
> model with categorical inputs in R:
> Copy code
> library(h2o)
> # Initialize H2O
> h2o.init()
> # Import your data into H2O
> df <- h2o.importFile("path/to/your/data.csv")
> # Specify which columns are categorical
> df[, "input_variable"] <- as.factor(df[, "input_variable"])
> # Train a model using the categorical input_variable column
> model <- h2o.randomForest(x = "input_variable", y = "output_variable",
> training_frame = df)
>
> On Sat, 7 Jan 2023 at 17:37, Eric Charles 
> wrote:
>
>> That's somewhat helpful. Having looked up several of these
>> algorithms (I'm still checking a few), it seems like they all input some
>> sort of distance measure between the items (analogous to the distance
>> between their coordinates on a cartesian graph), and then do some sort of
>> distance-minimization function. The challenge here is that I don't have
>> anything equivalent to that: The data is primarily categorical.
>>
>> Does anyone on here actually have experience doing that kind of work?
>>
>> It's not that it would be impossible for me to change the categorical
>> data into something more quantitative, but doing so would bake in my
>> assumptions about how the categories should be determined.
>> 
>>
>>
>> On Sat, Jan 7, 2023 at 1:32 AM Pieter Steenekamp <
>> piet...@randcontrols.co.za> wrote:
>>
>>> I asked https://chat.openai.com/chat and here is the conversation:
>>>
>>> *Pieter Steenekamp*
>>> can you suggest a solution for the following problem "I'm hoping someone
>>> here could help out. Let's imagine I had some data where each row was a
>>> person's career. We could list major events every year.For example: 2004
>>> they were highered, 2007 they get a promotion, 2010 they leave for a
>>> different company, 2012 they come back at a higher level, 2015 get a
>>> promotion, then no change until 2022.  Let's say I had data like this for
>>> roughly 2 million people, and that there are around 10 different types of
>>> changes that could happen during any time period (could be yearly,
>>> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
>>> computer to tell us if there were "types of careers" that people had. We
>>> could say "put all these careers into 4 buckets" or "7 buckets" based on
>>> similarity. Then we could look at the piles the computer made and try to
>>> make sense of them. One type might be "company man" for people who tend to
>>> stay in place for 20 or more years, another type could be a "rotator", who
>>> leaves and returns every 3 years or so. Etc. The point is, I want a
>>> computer to make the piles for me, rather than trying to come up with
>>> potential piles a priori. 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-07 Thread Pieter Steenekamp
One way to handle categorical input data for machine learning is to convert
it using one-hot encoding - it's not difficult but a bit cumbersome.
Fortunately there are other options. H2O is a machine learning library
available in both Python and R that does this conversion "under the hood".

I asked ChatGPT about "machine learning with categorical inputs using H2O",
she answered:

H2O is a popular open-source library for training machine learning models
in Python and R. It is designed to be scalable, fast, and easy to use, and
it includes a number of algorithms for classification, regression, and
clustering.

H2O can handle categorical variables natively, so you don't need to use
one-hot encoding or any other special techniques to work with them. When
you import your data into H2O, you can specify which columns are
categorical, and H2O will take care of the rest.

Here is an example of how you might use H2O to train a machine learning
model with categorical inputs in Python:

Copy code
import h2o
# Initialize H2O
h2o.init()
# Import your data into H2O
df = h2o.import_file("path/to/your/data.csv")
# Specify which columns are categorical
df['input_variable'] = df['input_variable'].asfactor()
# Train a model using the categorical input_variable column
model = h2o.H2ORandomForestEstimator(categorical_encoding =
"one_hot_explicit")
model.train(["input_variable"], "output_variable", df)
Here is an example of how you might use H2O to train a machine learning
model with categorical inputs in R:
Copy code
library(h2o)
# Initialize H2O
h2o.init()
# Import your data into H2O
df <- h2o.importFile("path/to/your/data.csv")
# Specify which columns are categorical
df[, "input_variable"] <- as.factor(df[, "input_variable"])
# Train a model using the categorical input_variable column
model <- h2o.randomForest(x = "input_variable", y = "output_variable",
training_frame = df)

On Sat, 7 Jan 2023 at 17:37, Eric Charles 
wrote:

> That's somewhat helpful. Having looked up several of these algorithms (I'm
> still checking a few), it seems like they all input some sort of distance
> measure between the items (analogous to the distance between
> their coordinates on a cartesian graph), and then do some sort of
> distance-minimization function. The challenge here is that I don't have
> anything equivalent to that: The data is primarily categorical.
>
> Does anyone on here actually have experience doing that kind of work?
>
> It's not that it would be impossible for me to change the categorical data
> into something more quantitative, but doing so would bake in my assumptions
> about how the categories should be determined.
> 
>
>
> On Sat, Jan 7, 2023 at 1:32 AM Pieter Steenekamp <
> piet...@randcontrols.co.za> wrote:
>
>> I asked https://chat.openai.com/chat and here is the conversation:
>>
>> *Pieter Steenekamp*
>> can you suggest a solution for the following problem "I'm hoping someone
>> here could help out. Let's imagine I had some data where each row was a
>> person's career. We could list major events every year.For example: 2004
>> they were highered, 2007 they get a promotion, 2010 they leave for a
>> different company, 2012 they come back at a higher level, 2015 get a
>> promotion, then no change until 2022.  Let's say I had data like this for
>> roughly 2 million people, and that there are around 10 different types of
>> changes that could happen during any time period (could be yearly,
>> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
>> computer to tell us if there were "types of careers" that people had. We
>> could say "put all these careers into 4 buckets" or "7 buckets" based on
>> similarity. Then we could look at the piles the computer made and try to
>> make sense of them. One type might be "company man" for people who tend to
>> stay in place for 20 or more years, another type could be a "rotator", who
>> leaves and returns every 3 years or so. Etc. The point is, I want a
>> computer to make the piles for me, rather than trying to come up with
>> potential piles a priori. Are there methods for doing this? I know it's a
>> problem we've *talked* about a lot, but I don't know if there are
>> solutions. "
>> *https://chat.openai.com/chat *
>> There are several approaches that you could use to group the careers of
>> these 2 million people into clusters based on similarity. One approach is
>> to use unsupervised machine learning techniques, specifically clustering
>> algorithms, to automatically group the data into clusters. Some popular
>> clustering algorithms that you could use for this problem include k-means,
>> hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
>> Applications with Noise). You could also use dimensionality reduction
>> techniques, such as PCA (Principal Component Analysis) or t-SNE
>> (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
>> of the data and then use a clustering algorithm to 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-07 Thread Eric Charles
That's somewhat helpful. Having looked up several of these algorithms (I'm
still checking a few), it seems like they all input some sort of distance
measure between the items (analogous to the distance between
their coordinates on a cartesian graph), and then do some sort of
distance-minimization function. The challenge here is that I don't have
anything equivalent to that: The data is primarily categorical.

Does anyone on here actually have experience doing that kind of work?

It's not that it would be impossible for me to change the categorical data
into something more quantitative, but doing so would bake in my assumptions
about how the categories should be determined.



On Sat, Jan 7, 2023 at 1:32 AM Pieter Steenekamp 
wrote:

> I asked https://chat.openai.com/chat and here is the conversation:
>
> *Pieter Steenekamp*
> can you suggest a solution for the following problem "I'm hoping someone
> here could help out. Let's imagine I had some data where each row was a
> person's career. We could list major events every year.For example: 2004
> they were highered, 2007 they get a promotion, 2010 they leave for a
> different company, 2012 they come back at a higher level, 2015 get a
> promotion, then no change until 2022.  Let's say I had data like this for
> roughly 2 million people, and that there are around 10 different types of
> changes that could happen during any time period (could be yearly,
> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
> computer to tell us if there were "types of careers" that people had. We
> could say "put all these careers into 4 buckets" or "7 buckets" based on
> similarity. Then we could look at the piles the computer made and try to
> make sense of them. One type might be "company man" for people who tend to
> stay in place for 20 or more years, another type could be a "rotator", who
> leaves and returns every 3 years or so. Etc. The point is, I want a
> computer to make the piles for me, rather than trying to come up with
> potential piles a priori. Are there methods for doing this? I know it's a
> problem we've *talked* about a lot, but I don't know if there are
> solutions. "
> *https://chat.openai.com/chat *
> There are several approaches that you could use to group the careers of
> these 2 million people into clusters based on similarity. One approach is
> to use unsupervised machine learning techniques, specifically clustering
> algorithms, to automatically group the data into clusters. Some popular
> clustering algorithms that you could use for this problem include k-means,
> hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
> Applications with Noise). You could also use dimensionality reduction
> techniques, such as PCA (Principal Component Analysis) or t-SNE
> (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
> of the data and then use a clustering algorithm to group the data into
> clusters. Another approach is to use supervised machine learning
> techniques, where you would need to manually label a subset of the data
> with the cluster labels, and then use this labeled data to train a
> classifier to predict the cluster labels for the remaining data. It's worth
> noting that these approaches will require some feature engineering, where
> you will need to extract relevant features from the data that can be used
> to represent the careers of the people in the dataset. Once you have
> identified and extracted these features, you can then use them as input to
> the machine learning algorithms to group the data into clusters.
> *Pieter Steenekamp*
> would it then be possible to get  for example if there were "types of
> careers" that people had?
> *https://chat.openai.com/chat *
> Yes, once you have grouped the careers of the 2 million people into
> clusters using one of the approaches I mentioned above, you can then
> analyze the characteristics of each cluster to identify common patterns and
> trends within each cluster. This will allow you to identify "types of
> careers" that people have, based on the similarity of their career events.
> For example, if one of the clusters is characterized by people who have had
> long tenures at a single company, you could label this cluster as "company
> man" careers. Similarly, if another cluster is characterized by people who
> frequently switch between companies, you could label this cluster as
> "rotator" careers. You can then use this information to gain insights into
> the types of careers that are represented in the dataset, and potentially
> draw conclusions about the factors that may influence career paths and
> success.
> *Pieter Steenekamp*
> can you propose python code to do sometingh like this please
> *https://chat.openai.com/chat *
> Sure, here is an example of how you could use the scikit-learn library in
> Python to group the careers of the 2 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-07 Thread Frank Wimberly
This answer seems reasonable to me.  I worked on Project Talent during 1967
which had some similar goals and data.  See

https://en.m.wikipedia.org/wiki/Project_Talent

Our data was for thousands of highschool students and our software was all
written in Fortran.

---
Frank C. Wimberly
140 Calle Ojo Feliz,
Santa Fe, NM 87505

505 670-9918
Santa Fe, NM

On Fri, Jan 6, 2023, 11:32 PM Pieter Steenekamp 
wrote:

> I asked https://chat.openai.com/chat and here is the conversation:
>
> *Pieter Steenekamp*
> can you suggest a solution for the following problem "I'm hoping someone
> here could help out. Let's imagine I had some data where each row was a
> person's career. We could list major events every year.For example: 2004
> they were highered, 2007 they get a promotion, 2010 they leave for a
> different company, 2012 they come back at a higher level, 2015 get a
> promotion, then no change until 2022.  Let's say I had data like this for
> roughly 2 million people, and that there are around 10 different types of
> changes that could happen during any time period (could be yearly,
> quarterly, monthly, I can make it how I want).  I was hoping we could ask a
> computer to tell us if there were "types of careers" that people had. We
> could say "put all these careers into 4 buckets" or "7 buckets" based on
> similarity. Then we could look at the piles the computer made and try to
> make sense of them. One type might be "company man" for people who tend to
> stay in place for 20 or more years, another type could be a "rotator", who
> leaves and returns every 3 years or so. Etc. The point is, I want a
> computer to make the piles for me, rather than trying to come up with
> potential piles a priori. Are there methods for doing this? I know it's a
> problem we've *talked* about a lot, but I don't know if there are
> solutions. "
> *https://chat.openai.com/chat *
> There are several approaches that you could use to group the careers of
> these 2 million people into clusters based on similarity. One approach is
> to use unsupervised machine learning techniques, specifically clustering
> algorithms, to automatically group the data into clusters. Some popular
> clustering algorithms that you could use for this problem include k-means,
> hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
> Applications with Noise). You could also use dimensionality reduction
> techniques, such as PCA (Principal Component Analysis) or t-SNE
> (t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
> of the data and then use a clustering algorithm to group the data into
> clusters. Another approach is to use supervised machine learning
> techniques, where you would need to manually label a subset of the data
> with the cluster labels, and then use this labeled data to train a
> classifier to predict the cluster labels for the remaining data. It's worth
> noting that these approaches will require some feature engineering, where
> you will need to extract relevant features from the data that can be used
> to represent the careers of the people in the dataset. Once you have
> identified and extracted these features, you can then use them as input to
> the machine learning algorithms to group the data into clusters.
> *Pieter Steenekamp*
> would it then be possible to get  for example if there were "types of
> careers" that people had?
> *https://chat.openai.com/chat *
> Yes, once you have grouped the careers of the 2 million people into
> clusters using one of the approaches I mentioned above, you can then
> analyze the characteristics of each cluster to identify common patterns and
> trends within each cluster. This will allow you to identify "types of
> careers" that people have, based on the similarity of their career events.
> For example, if one of the clusters is characterized by people who have had
> long tenures at a single company, you could label this cluster as "company
> man" careers. Similarly, if another cluster is characterized by people who
> frequently switch between companies, you could label this cluster as
> "rotator" careers. You can then use this information to gain insights into
> the types of careers that are represented in the dataset, and potentially
> draw conclusions about the factors that may influence career paths and
> success.
> *Pieter Steenekamp*
> can you propose python code to do sometingh like this please
> *https://chat.openai.com/chat *
> Sure, here is an example of how you could use the scikit-learn library in
> Python to group the careers of the 2 million people into clusters using the
> k-means algorithm:
> Copy code
> from sklearn.cluster import KMeans
> # Load the data into a NumPy array
> data = np.loadtxt('career_data.txt')
> # Create the k-means model
> model = KMeans(n_clusters=4)
> # Fit the model to the data
> model.fit(data)
> # Predict the cluster labels for 

Re: [FRIAM] Sorting Algorithm? AI? Identifying "types" within data

2023-01-06 Thread Pieter Steenekamp
I asked https://chat.openai.com/chat and here is the conversation:

*Pieter Steenekamp*
can you suggest a solution for the following problem "I'm hoping someone
here could help out. Let's imagine I had some data where each row was a
person's career. We could list major events every year.For example: 2004
they were highered, 2007 they get a promotion, 2010 they leave for a
different company, 2012 they come back at a higher level, 2015 get a
promotion, then no change until 2022.  Let's say I had data like this for
roughly 2 million people, and that there are around 10 different types of
changes that could happen during any time period (could be yearly,
quarterly, monthly, I can make it how I want).  I was hoping we could ask a
computer to tell us if there were "types of careers" that people had. We
could say "put all these careers into 4 buckets" or "7 buckets" based on
similarity. Then we could look at the piles the computer made and try to
make sense of them. One type might be "company man" for people who tend to
stay in place for 20 or more years, another type could be a "rotator", who
leaves and returns every 3 years or so. Etc. The point is, I want a
computer to make the piles for me, rather than trying to come up with
potential piles a priori. Are there methods for doing this? I know it's a
problem we've *talked* about a lot, but I don't know if there are
solutions. "
*https://chat.openai.com/chat *
There are several approaches that you could use to group the careers of
these 2 million people into clusters based on similarity. One approach is
to use unsupervised machine learning techniques, specifically clustering
algorithms, to automatically group the data into clusters. Some popular
clustering algorithms that you could use for this problem include k-means,
hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of
Applications with Noise). You could also use dimensionality reduction
techniques, such as PCA (Principal Component Analysis) or t-SNE
(t-Distributed Stochastic Neighbor Embedding), to reduce the dimensionality
of the data and then use a clustering algorithm to group the data into
clusters. Another approach is to use supervised machine learning
techniques, where you would need to manually label a subset of the data
with the cluster labels, and then use this labeled data to train a
classifier to predict the cluster labels for the remaining data. It's worth
noting that these approaches will require some feature engineering, where
you will need to extract relevant features from the data that can be used
to represent the careers of the people in the dataset. Once you have
identified and extracted these features, you can then use them as input to
the machine learning algorithms to group the data into clusters.
*Pieter Steenekamp*
would it then be possible to get  for example if there were "types of
careers" that people had?
*https://chat.openai.com/chat *
Yes, once you have grouped the careers of the 2 million people into
clusters using one of the approaches I mentioned above, you can then
analyze the characteristics of each cluster to identify common patterns and
trends within each cluster. This will allow you to identify "types of
careers" that people have, based on the similarity of their career events.
For example, if one of the clusters is characterized by people who have had
long tenures at a single company, you could label this cluster as "company
man" careers. Similarly, if another cluster is characterized by people who
frequently switch between companies, you could label this cluster as
"rotator" careers. You can then use this information to gain insights into
the types of careers that are represented in the dataset, and potentially
draw conclusions about the factors that may influence career paths and
success.
*Pieter Steenekamp*
can you propose python code to do sometingh like this please
*https://chat.openai.com/chat *
Sure, here is an example of how you could use the scikit-learn library in
Python to group the careers of the 2 million people into clusters using the
k-means algorithm:
Copy code
from sklearn.cluster import KMeans
# Load the data into a NumPy array
data = np.loadtxt('career_data.txt')
# Create the k-means model
model = KMeans(n_clusters=4)
# Fit the model to the data
model.fit(data)
# Predict the cluster labels for each data point
labels = model.predict(data)
# Print the cluster labels
print(labels)
This code will group the data into 4 clusters, as specified by the
n_clusters parameter. You can adjust this parameter to change the number of
clusters that the data is grouped into.
To extract features from the data that can be used as input to the k-means
model, you will need to preprocess the data and select relevant features.
This will likely involve some feature engineering, which will depend on the
specific characteristics of the data and the goals of your analysis.