[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15022830#comment-15022830
 ] 

Joseph K. Bradley commented on SPARK-11234:
-------------------------------------------

[~yinxusen] Thank you for working through this task!  Here are some of my 
thoughts:

{quote}1. Currently, multi-line per record JSON file is hard to handle, I have 
to load the data with JsonInputFormat in the json-pxf-ext package.
{quote}
* WIP, but no clear ETA [SPARK-7366]

{quote}2. String indexer is easy to use. But it is hard to do beyond existing 
transformers. Like in the code, when I want to add all vectors that belong to 
the same id together, I have to write an aggregate function.
{quote}
* Does the SQLTransformer help?  If you could pick any API to write this 
operation, what would be ideal for you?  (I'm envisioning something analogous 
to a UDF for ML Pipelines, but that is almost provided by the SQLTransformer.)

{quote}3. ParamGridBuilder accepts discrete parameter candidates, but I need to 
add some parameters with guess like Array(1.0, 0.1, 0.01). I don't know which 
parameter is suitable and how to fill in the array will get a better result. 
How about giving a range of real numbers so that the ParamGridBuilder can 
generate candidates for me like [0.0001, 1]?
{quote}

Do you mean it should automatically zoom in on regions which seem to get good 
results?  I agree this can help in practice; I did something like this for a 
different ML library.

{quote}4. The evaluator forces me to select a metric method. But sometimes I 
want to see all the evaluation results, say F1, precision-recall, AUC, etc.
{quote}

Do you want the metrics (a) for the sake of viewing performance at the end of a 
test?  Or do you want the metrics (b) for model selection?  If it's for (a) 
viewing at the end of a test, then model summaries are probably the way to go.  
Only LinearRegression and LogisticRegression have summaries currently, but we 
should add them for other models too.

{quote}5. ML transformers will get stuck when facing with Int type. It's 
strange that we have to transform all Int values to double values before hand. 
I think a wise auto casting is helpful.
{quote}

I agree that too many Transformers are brittle when it comes to accepting 
multiple Numeric types.  I had made an umbrella here [SPARK-11107], but perhaps 
we can think of a way to make this change everywhere, rather than case-by-case.

> What's cooking classification
> -----------------------------
>
>                 Key: SPARK-11234
>                 URL: https://issues.apache.org/jira/browse/SPARK-11234
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to