[
https://issues.apache.org/jira/browse/DATAFU-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950069#comment-16950069
]
Eyal Allweil edited comment on DATAFU-150 at 10/12/19 3:34 PM:
---------------------------------------------------------------
[~russell.jurney], I was trying to understand what this does and found the
[OneHotEncoder in Spark ML|https://org.apache.spark.ml.feature.OneHotEncoder].
Can you explain the difference between the two? (I'm afraid my data science
skills aren't what they should be)
was (Author: eyal):
[~russell.jurney], I was trying to understand what this does and found the
[OneHotEncoder in Spark ML|org.apache.spark.ml.feature.OneHotEncoder]. Can you
explain the difference between the two? (I'm afraid my data science skills
aren't what they should be)
> Add MultiLabelOneHotEncoder
> ---------------------------
>
> Key: DATAFU-150
> URL: https://issues.apache.org/jira/browse/DATAFU-150
> Project: DataFu
> Issue Type: Improvement
> Reporter: Russell Jurney
> Assignee: Russell Jurney
> Priority: Major
>
> I have created the following code in Python to one-hot encode multilabel data
> and would like to add it to DataFu:
> {code:java}
> questions_tags = filtered_lists.map(
> lambda x: Row(
> _Body=x[0],
> _Tags=x[1]
> )
> ).toDF()
> questions_tags.show()
> # Create indexes for each multilabel tag
> enumerated_labels = [
> z for z in enumerate(
> sorted(
> remaining_tags_df.rdd
> .groupBy(lambda x: 1)
> .flatMap(lambda x: [y.tag for y in x[1]])
> .collect()
> )
> )
> ]
> tag_index = {x: i for i, x in enumerated_labels}
> index_tag = {i: x for i, x in enumerated_labels}# Explicitly free RAM
> def one_hot_encode(tag_list, enumerated_labels):
> """PySpark can't one-hot-encode multilabel data, so we do it ourselves."""
> one_hot_row = []
> for i, label in enumerated_labels:
> if index_tag[i] in tag_list:
> one_hot_row.append(1)
> else:
> one_hot_row.append(0)
> assert(len(one_hot_row) == len(enumerated_labels))
> return one_hot_row
> # Write the one-hot-encoded questions to S3 as a parquet file
> one_hot_questions = questions_tags.rdd.map(
> lambda x: Row(
> _Body=x._Body,
> _Tags=one_hot_encode(x._Tags, enumerated_labels)
> )
> )
> # Create a DataFrame
> schema = T.StructType([
> T.StructField("_Body", T.ArrayType(
> T.StringType()
> )),
> T.StructField("_Tags", T.ArrayType(
> T.IntegerType()
> ))
> ])
> one_hot_df = spark.createDataFrame(
> one_hot_questions,
> schema
> )
> one_hot_df.show()
> {code}
> Which shows:
> {code}
> +--------------------+--------------------+
> | _Body| _Tags|
> +--------------------+--------------------+
> |[Convert, Decimal...|[0, 0, 0, 0, 0, 0...|
> |[Percentage, widt...|[0, 0, 0, 0, 0, 0...|
> |[How, I, calculat...|[0, 1, 0, 0, 0, 0...|
> |[Calculate, relat...|[0, 0, 0, 0, 0, 0...|
> |[Determine, user,...|[0, 0, 0, 0, 0, 0...|
> |[Difference, Math...|[0, 1, 0, 0, 0, 0...|
> |[Filling, DataSet...|[0, 0, 1, 0, 0, 0...|
> |[Binary, Data, My...|[0, 0, 0, 0, 0, 0...|
> |[What, fastest, w...|[0, 0, 0, 0, 0, 0...|
> |[Throw, error, My...|[0, 0, 0, 0, 0, 0...|
> |[How, use, C, soc...|[0, 0, 0, 0, 0, 0...|
> |[Unloading, ByteA...|[0, 0, 0, 0, 0, 0...|
> |[Check, changes, ...|[0, 0, 0, 0, 0, 0...|
> |[Reliable, timer,...|[0, 1, 0, 0, 0, 0...|
> |[Best, way, allow...|[0, 0, 0, 0, 0, 0...|
> |[Multiple, submit...|[0, 0, 0, 0, 0, 0...|
> |[How, I, get, dis...|[0, 0, 1, 0, 0, 0...|
> |[Paging, collecti...|[0, 0, 1, 0, 0, 0...|
> |[How, I, add, exi...|[0, 0, 0, 0, 0, 0...|
> |[Getting, Subclip...|[0, 0, 0, 0, 0, 0...|
> +--------------------+--------------------+
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)