[jira] [Commented] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column
[ https://issues.apache.org/jira/browse/SPARK-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743831#comment-14743831 ] Karen Yin-Yee Ng commented on SPARK-10578: -- Thanks [~josephkb] and [~viirya] for the quick response. > pyspark.ml.classification.RandomForestClassifer does not return > `rawPrediction` column > -- > > Key: SPARK-10578 > URL: https://issues.apache.org/jira/browse/SPARK-10578 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.4.0, 1.4.1 > Environment: CentOS, PySpark 1.4.1, Scala 2.10 >Reporter: Karen Yin-Yee Ng >Assignee: Joseph K. Bradley > Fix For: 1.5.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > To use `pyspark.ml.classification.RandomForestClassifer` with > `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be > returned by the `RandomForestClassifer`. > The PySpark documentation example of `logisticsRegression`outputs the > `rawPrediction` column but not `RandomForestClassifier`. > Therefore, one is unable to use `RandomForestClassifier` with the evaluator > nor put it in a pipeline with cross validation. > A relevant piece of code showing how to reproduce the bug can be found at: > https://gist.github.com/karenyyng/cf61ae655b032f754bfb > A relevant post due to this possible bug can also be found at: > http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column
Karen Yin-Yee Ng created SPARK-10578: Summary: pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column Key: SPARK-10578 URL: https://issues.apache.org/jira/browse/SPARK-10578 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.1, 1.4.0 Environment: CentOS, PySpark 1.4.1, Scala 2.10 Reporter: Karen Yin-Yee Ng To use `pyspark.ml.classification.RandomForestClassifer` with `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be returned by the `RandomForestClassifer`. The PySpark documentation example of `logisticsRegression`outputs the `rawPrediction` column but not `RandomForestClassifier`. Therefore, one is unable to use `RandomForestClassifier` with the evaluator nor put it in a pipeline with cross validation. A relevant piece of code showing how to reproduce the bug can be found at: https://gist.github.com/karenyyng/cf61ae655b032f754bfb A relevant post due to this possible bug can also be found at: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712593#comment-14712593 ] Karen Yin-Yee Ng commented on SPARK-9807: - I have an adhoc piece of python code that will parse a dataframe schema from python strings similar to what Yanbo Liang has mentioned. But that is not the point. The basic functionality of parsing CSV / TSV should be part of PySpark. I should have submitted and will submit a feature request. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712597#comment-14712597 ] Karen Yin-Yee Ng commented on SPARK-9807: - It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712599#comment-14712599 ] Karen Yin-Yee Ng commented on SPARK-9807: - It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712598#comment-14712598 ] Karen Yin-Yee Ng commented on SPARK-9807: - It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712596#comment-14712596 ] Karen Yin-Yee Ng commented on SPARK-9807: - It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712595#comment-14712595 ] Karen Yin-Yee Ng commented on SPARK-9807: - It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Yin-Yee Ng updated SPARK-9807: Comment: was deleted (was: It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective.) pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Yin-Yee Ng updated SPARK-9807: Comment: was deleted (was: It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective.) pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Yin-Yee Ng updated SPARK-9807: Comment: was deleted (was: It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective.) pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Yin-Yee Ng updated SPARK-9807: Comment: was deleted (was: It just means that the DataFrame keeps the data type from the RDD. It has not done any type inference from a user's perspective.) pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
[ https://issues.apache.org/jira/browse/SPARK-9807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712583#comment-14712583 ] Karen Yin-Yee Ng commented on SPARK-9807: - According to the documentation at http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=createdataframe#pyspark.sql.SQLContext.createDataFrame it says: When schema is a list of column names, the type of each column will be inferred from data. I did supply the `sqlContext.createDataFrame` method with the column names in my example. Please correct the documentation if the type inference is not supposed to work. pyspark.sql.createDataFrame does not infer data type of parsed TSV -- Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng Original Estimate: 24h Remaining Estimate: 24h I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9807) pyspark.sql.createDataFrame does not infer data type of parsed TSV
Karen Yin-Yee Ng created SPARK-9807: --- Summary: pyspark.sql.createDataFrame does not infer data type of parsed TSV Key: SPARK-9807 URL: https://issues.apache.org/jira/browse/SPARK-9807 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Environment: CentOS 6, Python version 2.7.10, Scala version 2-10 Reporter: Karen Yin-Yee Ng I tried parsing a space-separated file from HDFS. And using `pyspark.sqlContext.createDataFrame` to convert the parsed lines to a PySpark DataFrame. However, all entries are parsed as string type regardless of what the correct data type is. An example of my code and output can be found at: https://gist.github.com/karenyyng/a1264d6344c54df4fcc5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org