[jira] [Commented] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-03-30 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949418#comment-15949418
 ] 

Benjamin Fradet commented on SPARK-20133:
-

Can I take this one?

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20097) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR

2017-03-25 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-20097:
---

 Summary: Fix visibility discrepancy with numInstances and 
degreesOfFreedom in LR and GLR
 Key: SPARK-20097
 URL: https://issues.apache.org/jira/browse/SPARK-20097
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Benjamin Fradet
Priority: Trivial


- numInstances is public in lr and regression private in glr
- degreesOfFreedom is private in lr and public in glr



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-10-27 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611926#comment-15611926
 ] 

Benjamin Fradet commented on SPARK-16857:
-

I was wondering why a KMeansEvalutor computing the wsse hasn't been implemented 
yet.

Any ideas why not?

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-05-28 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305275#comment-15305275
 ] 

Benjamin Fradet commented on SPARK-15581:
-

Thanks, we should maybe add it to the roadmap, don't you think?


> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> 

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-05-27 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304869#comment-15304869
 ] 

Benjamin Fradet commented on SPARK-15581:
-

[~josephkb] Just out of curiosity: I don't see any mention of supporting 
multiclass classification for gbt or logistic regression, is this something 
that is still planned?

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python 

[jira] [Commented] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-08 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275524#comment-15275524
 ] 

Benjamin Fradet commented on SPARK-15200:
-

woops, didnt see it linked to 15101

> Add documentaion and examples for GaussianMixture
> -
>
> Key: SPARK-15200
> URL: https://issues.apache.org/jira/browse/SPARK-15200
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-07 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275227#comment-15275227
 ] 

Benjamin Fradet commented on SPARK-15200:
-

I've started working on this

> Add documentaion and examples for GaussianMixture
> -
>
> Key: SPARK-15200
> URL: https://issues.apache.org/jira/browse/SPARK-15200
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15200) Add documentaion and examples for GaussianMixture

2016-05-07 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-15200:
---

 Summary: Add documentaion and examples for GaussianMixture
 Key: SPARK-15200
 URL: https://issues.apache.org/jira/browse/SPARK-15200
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Benjamin Fradet
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy

2016-04-30 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265261#comment-15265261
 ] 

Benjamin Fradet commented on SPARK-14985:
-

I'll take this one if you guys don't mind.

> Update LinearRegression, LogisticRegression summary internals to handle model 
> copy
> --
>
> Key: SPARK-14985
> URL: https://issues.apache.org/jira/browse/SPARK-14985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA + the PR for [SPARK-14852] for details.  The summaries should 
> handle creating an internal copy of the model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14817) ML 2.0 QA: Programming guide update and migration guide

2016-04-22 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254196#comment-15254196
 ] 

Benjamin Fradet commented on SPARK-14817:
-

Count me in!

> ML 2.0 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-14817
> URL: https://issues.apache.org/jira/browse/SPARK-14817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>
> Before the release, we need to update the MLlib Programming Guide.  Updates 
> will include:
> * Make the DataFrame-based API (spark.ml) front-and-center, to make it clear 
> the RDD-based API is the older, maintenance-mode one.
> ** No docs for spark.mllib will be deleted; they will just be reorganized and 
> put in a subsection.
> ** If spark.ml docs are less complete, or if spark.ml docs say "refer to the 
> spark.mllib docs for details," then we should copy those details to the 
> spark.ml docs.
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs.
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work (which should be broken into pieces for 
> this larger 2.0 release).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14570) Log instrumentation in Random forests

2016-04-20 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250067#comment-15250067
 ] 

Benjamin Fradet commented on SPARK-14570:
-

I'll take this one if you guys don't mind.

> Log instrumentation in Random forests
> -
>
> Key: SPARK-14570
> URL: https://issues.apache.org/jira/browse/SPARK-14570
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14730) Expose ColumnPruner as feature transformer

2016-04-20 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250050#comment-15250050
 ] 

Benjamin Fradet commented on SPARK-14730:
-

[~jlaskowski], [~yanboliang] are one of you guys working on this?

> Expose ColumnPruner as feature transformer
> --
>
> Key: SPARK-14730
> URL: https://issues.apache.org/jira/browse/SPARK-14730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Jacek Laskowski
>Priority: Minor
>
> From d...@spark.apache.org:
> {quote}
> Jacek:
> Came across `private class ColumnPruner` with "TODO(ekl) make this a
> public transformer" in scaladoc, cf.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L317.
> Why is this private and is there a JIRA for the TODO(ekl)?
> {quote}
> {quote}
> Yanbo Liang:
> This is due to ColumnPruner is only used for RFormula currently, we did not 
> expose it as a feature transformer.
> Please feel free to create JIRA and work on it.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12983) Correct metrics.properties.template

2016-01-25 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12983:
---

 Summary: Correct metrics.properties.template
 Key: SPARK-12983
 URL: https://issues.apache.org/jira/browse/SPARK-12983
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Spark Core
Reporter: Benjamin Fradet
Priority: Minor


There are some typos or plain unintelligible sentences in the metrics template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12858) Remove duplicated code in metrics

2016-01-24 Thread Benjamin Fradet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Fradet closed SPARK-12858.
---
Resolution: Not A Problem

> Remove duplicated code in metrics
> -
>
> Key: SPARK-12858
> URL: https://issues.apache.org/jira/browse/SPARK-12858
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Benjamin Fradet
>Priority: Minor
>
> I noticed there is some duplicated code in the sinks regarding the poll 
> period.
> Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12858) Remove duplicated code in metrics

2016-01-16 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12858:
---

 Summary: Remove duplicated code in metrics
 Key: SPARK-12858
 URL: https://issues.apache.org/jira/browse/SPARK-12858
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Benjamin Fradet
Priority: Minor


I noticed there is some duplicated code in the sinks regarding the poll period.
Also, parts of the metrics.properties template are unclear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9716) BinaryClassificationEvaluator should accept Double prediction column

2015-12-24 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070992#comment-15070992
 ] 

Benjamin Fradet commented on SPARK-9716:


Somewhat related, I think `RegressionEvaluator` should accept all numeric type 
as prediction column.
Because, for example, in the case of als which produces float predictions, we 
need to convert those to double before we're able to use the 
`RegressionEvaluator`.

> BinaryClassificationEvaluator should accept Double prediction column
> 
>
> Key: SPARK-9716
> URL: https://issues.apache.org/jira/browse/SPARK-9716
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> BinaryClassificationEvaluator currently expects the rawPrediction column, of 
> type Vector.  It should also accept a Double prediction column, with a 
> different set of supported metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9716) BinaryClassificationEvaluator should accept Double prediction column

2015-12-24 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070992#comment-15070992
 ] 

Benjamin Fradet edited comment on SPARK-9716 at 12/24/15 1:19 PM:
--

Somewhat related, I think `RegressionEvaluator` should accept all numeric type 
as prediction column.

Because, for example, in the case of als which produces float predictions, we 
need to convert those to double before we're able to use the 
`RegressionEvaluator`.

Conversely, we could make als make double predictions in order to keep things 
consistent across  Estimators.


was (Author: benfradet):
Somewhat related, I think `RegressionEvaluator` should accept all numeric type 
as prediction column.
Because, for example, in the case of als which produces float predictions, we 
need to convert those to double before we're able to use the 
`RegressionEvaluator`.

> BinaryClassificationEvaluator should accept Double prediction column
> 
>
> Key: SPARK-9716
> URL: https://issues.apache.org/jira/browse/SPARK-9716
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> BinaryClassificationEvaluator currently expects the rawPrediction column, of 
> type Vector.  It should also accept a Double prediction column, with a 
> different set of supported metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-24 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071062#comment-15071062
 ] 

Benjamin Fradet commented on SPARK-12247:
-

The [PR|https://github.com/apache/spark/pull/10411] has been out for a few days.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-23 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069366#comment-15069366
 ] 

Benjamin Fradet commented on SPARK-12247:
-

Yup, I was thinking of keeping only the rmse calculation too.
We could also just compute the rmse using the regression evaluator instead of 
doing it "manually", what do you think?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-22 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067726#comment-15067726
 ] 

Benjamin Fradet commented on SPARK-12247:
-

[~thunterdb] Do you think I should also include the calculation on of false 
positives as in [the movie lens 
example|https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/MovieLensALS.scala#L167]?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067021#comment-15067021
 ] 

Benjamin Fradet commented on SPARK-12247:
-

Ok thanks, I'll rework the examples accordingly.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065417#comment-15065417
 ] 

Benjamin Fradet commented on SPARK-12247:
-

By the way, should we repurpose 
[MovieLensALS|https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/MovieLensALS.scala]
 or keep it alongside the documentation example?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065344#comment-15065344
 ] 

Benjamin Fradet commented on SPARK-12247:
-

I've started working on this.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9716) BinaryClassificationEvaluator should accept Double prediction column

2015-12-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065342#comment-15065342
 ] 

Benjamin Fradet commented on SPARK-9716:


[~lkhamsurenl] Are you working on it or can I take over?

> BinaryClassificationEvaluator should accept Double prediction column
> 
>
> Key: SPARK-9716
> URL: https://issues.apache.org/jira/browse/SPARK-9716
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> BinaryClassificationEvaluator currently expects the rawPrediction column, of 
> type Vector.  It should also accept a Double prediction column, with a 
> different set of supported metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12368) Better doc for the binary classification evaluator setMetricName method

2015-12-16 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12368:
---

 Summary: Better doc for the binary classification evaluator 
setMetricName method
 Key: SPARK-12368
 URL: https://issues.apache.org/jira/browse/SPARK-12368
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Reporter: Benjamin Fradet
Priority: Minor


For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
"areaUnderPR" is supported, only that the default is "areadUnderROC".

Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the 
setMetric method in each of these evaluators."
However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12368) Better doc for the binary classification evaluator setMetricName method

2015-12-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060167#comment-15060167
 ] 

Benjamin Fradet commented on SPARK-12368:
-

I've started working on this.

> Better doc for the binary classification evaluator setMetricName method
> ---
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12368) Better doc for the binary classification evaluator' metricName

2015-12-16 Thread Benjamin Fradet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Fradet updated SPARK-12368:

Summary: Better doc for the binary classification evaluator' metricName  
(was: Better doc for the binary classification evaluator setMetricName method)

> Better doc for the binary classification evaluator' metricName
> --
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-12-12 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054664#comment-15054664
 ] 

Benjamin Fradet commented on SPARK-7425:


Is there anyone working on this?
Because I'm considering taking over this jira.

I started writing some unit tests for a few predictors and I'm wondering if I 
should write unit tests for all the predictors?
Input welcome.

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-10 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051698#comment-15051698
 ] 

Benjamin Fradet commented on SPARK-12217:
-

Sorry [~srowen], my bad, I wanted to duplicate the values on a previous jira 
but didnt know the implications.

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-09 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049116#comment-15049116
 ] 

Benjamin Fradet commented on SPARK-9059:


There is a python code snipped like the java and scala ones in the docs on 
master 
[here|https://github.com/apache/spark/blob/master/docs/streaming-kafka-integration.md#approach-2-direct-approach-no-receivers].
However, my understanding was that this wasn't the point of this jira. As I 
understood it, it was originally to incorporate in the code examples, or 
duplicate into a new example, the use of `HasOffsetRanges` like the [scala 
one|https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala].

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-09 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049116#comment-15049116
 ] 

Benjamin Fradet edited comment on SPARK-9059 at 12/10/15 6:49 AM:
--

There is a python code snipped like the java and scala ones in the docs on 
master 
[here|https://github.com/apache/spark/blob/master/docs/streaming-kafka-integration.md#approach-2-direct-approach-no-receivers].
However, my understanding was that this wasn't the point of this jira. As I 
understood it, it was originally to incorporate in the code examples, or 
duplicate into a new example, the use of `HasOffsetRanges`.


was (Author: benfradet):
There is a python code snipped like the java and scala ones in the docs on 
master 
[here|https://github.com/apache/spark/blob/master/docs/streaming-kafka-integration.md#approach-2-direct-approach-no-receivers].
However, my understanding was that this wasn't the point of this jira. As I 
understood it, it was originally to incorporate in the code examples, or 
duplicate into a new example, the use of `HasOffsetRanges` like the [scala 
one|https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala].

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-12-08 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048177#comment-15048177
 ] 

Benjamin Fradet commented on SPARK-9059:


Hi [~neelesh77],

I know the documentation has been updated and I don't see any use of 
`HasOffsetRanges` in the Scala or Java examples.
Pinging [~tdas], to get more information.

> Update Python Direct Kafka Word count examples to show the use of 
> HasOffsetRanges
> -
>
> Key: SPARK-9059
> URL: https://issues.apache.org/jira/browse/SPARK-9059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Minor
>  Labels: starter
>
> Update Python examples of Direct Kafka word count to access the offset ranges 
> using HasOffsetRanges and print it. For example in Scala,
>  
> {code}
> var offsetRanges: Array[OffsetRange] = _
> ...
> directKafkaDStream.foreachRDD { rdd => 
> offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
> }
> ...
> transformedDStream.foreachRDD { rdd => 
> // some operation
> println("Processed ranges: " + offsetRanges)
> }
> {code}
> See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
> more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-08 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12217:
---

 Summary: Document invalid handling for StringIndexer
 Key: SPARK-12217
 URL: https://issues.apache.org/jira/browse/SPARK-12217
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Benjamin Fradet
Priority: Minor
 Fix For: 1.6.1, 2.0.0


Documentation is needed regarding the handling of invalid labels in 
StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-08 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047545#comment-15047545
 ] 

Benjamin Fradet commented on SPARK-12217:
-

I've started working on this.

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12159) Add user guide section for IndexToString transformer

2015-12-05 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043605#comment-15043605
 ] 

Benjamin Fradet commented on SPARK-12159:
-

I've started working on this.

> Add user guide section for IndexToString transformer
> 
>
> Key: SPARK-12159
> URL: https://issues.apache.org/jira/browse/SPARK-12159
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add a user guide section for the IndexToString transformer as reported on the 
> mailing list ( 
> https://www.mail-archive.com/dev@spark.apache.org/msg12263.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11902) Unhandled case in VectorAssembler#transform

2015-11-21 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-11902:
---

 Summary: Unhandled case in VectorAssembler#transform
 Key: SPARK-11902
 URL: https://issues.apache.org/jira/browse/SPARK-11902
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.2
Reporter: Benjamin Fradet
Priority: Minor


I noticed that there is an unhandled case in the transform method of 
VectorAssembler if one of the input columns doesn't have one of the supported 
type DoubleType, NumericType, BooleanType or VectorUDT. 

So, if you try to transform a column of StringType you get a cryptic 
"scala.MatchError: StringType".

Will submit a PR shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9002) KryoSerializer initialization does not include 'Array[Int]'

2015-07-22 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636832#comment-14636832
 ] 

Benjamin Fradet commented on SPARK-9002:


[~rake] are you planning on opening a PR?

 KryoSerializer initialization does not include 'Array[Int]'
 ---

 Key: SPARK-9002
 URL: https://issues.apache.org/jira/browse/SPARK-9002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: MacBook Pro, OS X 10.10.4, Spark 1.4.0, master=local[*], 
 IntelliJ IDEA.
Reporter: Randy Kerber
Priority: Minor
  Labels: easyfix, newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 The object KryoSerializer (inside KryoRegistrator.scala) contains a list of 
 classes that are automatically registered with Kryo.  That list includes:
 Array\[Byte], Array\[Long], and Array\[Short].  Array\[Int] is missing from 
 that list.  Can't think of any good reason it shouldn't also be included.
 Note: This is first time creating an issue or contributing code to an apache 
 project. Apologies if I'm not following the process correct. Appreciate any 
 guidance or assistance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9057) Add Scala, Java and Python example to show DStream.transform

2015-07-21 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635945#comment-14635945
 ] 

Benjamin Fradet commented on SPARK-9057:


One thing that would be interesting as well would be to demonstrate a restart 
from checkpoint resilient use of an accumulator or a broadcast variable as 
detailed in [SPARK-5206|https://issues.apache.org/jira/browse/SPARK-5206]

 Add Scala, Java and Python example to show DStream.transform
 

 Key: SPARK-9057
 URL: https://issues.apache.org/jira/browse/SPARK-9057
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
  Labels: starter

 Currently there is no example to show the use of transform. Would be good to 
 add an example, that uses transform to join a static RDD with the RDDs of a 
 DStream.
 Need to be done for all 3 supported languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-07-21 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635941#comment-14635941
 ] 

Benjamin Fradet commented on SPARK-9059:


I have a version with the updated doc regarding python, I don't know if I 
should wait for the PR to be closed before opening mine or not.

 Update Python Direct Kafka Word count examples to show the use of 
 HasOffsetRanges
 -

 Key: SPARK-9059
 URL: https://issues.apache.org/jira/browse/SPARK-9059
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Priority: Minor
  Labels: starter

 Update Python examples of Direct Kafka word count to access the offset ranges 
 using HasOffsetRanges and print it. For example in Scala,
  
 {code}
 var offsetRanges: Array[OffsetRange] = _
 ...
 directKafkaDStream.foreachRDD { rdd = 
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
 }
 ...
 transformedDStream.foreachRDD { rdd = 
 // some operation
 println(Processed ranges:  + offsetRanges)
 }
 {code}
 See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
 more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-07-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630984#comment-14630984
 ] 

Benjamin Fradet commented on SPARK-9059:


Agreed.

 Update Direct Kafka Word count examples to show the use of HasOffsetRanges
 --

 Key: SPARK-9059
 URL: https://issues.apache.org/jira/browse/SPARK-9059
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
  Labels: starter

 Update Scala, Java and Python examples of Direct Kafka word count to access 
 the offset ranges using HasOffsetRanges and print it. For example in Scala,
  
 {code}
 var offsetRanges: Array[OffsetRange] = _
 ...
 directKafkaDStream.foreachRDD { rdd = 
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
 }
 ...
 transformedDStream.foreachRDD { rdd = 
 // some operation
 println(Processed ranges:  + offsetRanges)
 }
 {code}
 See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
 more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-07-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631142#comment-14631142
 ] 

Benjamin Fradet commented on SPARK-9059:


We could also demonstrate restarting from a specific set of offsets.

 Update Direct Kafka Word count examples to show the use of HasOffsetRanges
 --

 Key: SPARK-9059
 URL: https://issues.apache.org/jira/browse/SPARK-9059
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
  Labels: starter

 Update Scala, Java and Python examples of Direct Kafka word count to access 
 the offset ranges using HasOffsetRanges and print it. For example in Scala,
  
 {code}
 var offsetRanges: Array[OffsetRange] = _
 ...
 directKafkaDStream.foreachRDD { rdd = 
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
 }
 ...
 transformedDStream.foreachRDD { rdd = 
 // some operation
 println(Processed ranges:  + offsetRanges)
 }
 {code}
 See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
 more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9059) Update Direct Kafka Word count examples to show the use of HasOffsetRanges

2015-07-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629770#comment-14629770
 ] 

Benjamin Fradet commented on SPARK-9059:


I've started working on this.

 Update Direct Kafka Word count examples to show the use of HasOffsetRanges
 --

 Key: SPARK-9059
 URL: https://issues.apache.org/jira/browse/SPARK-9059
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
  Labels: starter

 Update Scala, Java and Python examples of Direct Kafka word count to access 
 the offset ranges using HasOffsetRanges and print it. For example in Scala,
  
 {code}
 var offsetRanges: Array[OffsetRange] = _
 ...
 directKafkaDStream.foreachRDD { rdd = 
 offsetRanges = rdd.asInstanceOf[HasOffsetRanges]  
 }
 ...
 transformedDStream.foreachRDD { rdd = 
 // some operation
 println(Processed ranges:  + offsetRanges)
 }
 {code}
 See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for 
 more info, and the master source code for more updated information on python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8575) Deprecate callUDF in favor of udf

2015-06-23 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598353#comment-14598353
 ] 

Benjamin Fradet commented on SPARK-8575:


I've started working on this issue.

 Deprecate callUDF in favor of udf
 -

 Key: SPARK-8575
 URL: https://issues.apache.org/jira/browse/SPARK-8575
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Priority: Minor
 Fix For: 1.5.0


 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to 
 use {{callUDF}} in favor of {{udf}} wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8575) Deprecate callUDF in favor of udf

2015-06-23 Thread Benjamin Fradet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Fradet updated SPARK-8575:
---
Description: Follow-up of 
[SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to use 
{{callUDF}} in favor of {{udf}} wherever possible.  (was: Follow-up of 
[SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to deprecate 
callUDF in favor of udf wherever possible.)

 Deprecate callUDF in favor of udf
 -

 Key: SPARK-8575
 URL: https://issues.apache.org/jira/browse/SPARK-8575
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Priority: Minor
 Fix For: 1.5.0


 Follow-up of [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356] to 
 use {{callUDF}} in favor of {{udf}} wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8115) Remove TestData

2015-06-21 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595052#comment-14595052
 ] 

Benjamin Fradet commented on SPARK-8115:


I've started working on this.

 Remove TestData
 ---

 Key: SPARK-8115
 URL: https://issues.apache.org/jira/browse/SPARK-8115
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Minor

 TestData was from the era when we didn't have easy ways to generate test 
 datasets. Now we have implicits on Seq + toDF, it'd make more sense to put 
 the test datasets closer to the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8478) Harmonize UDF-related code to use uniformly UDF instead of Udf

2015-06-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593368#comment-14593368
 ] 

Benjamin Fradet commented on SPARK-8478:


As discussed on [SPARK-8356|https://issues.apache.org/jira/browse/SPARK-8356], 
it'd be cool to harmonize code regarding UDFs,
I've started working on this.

 Harmonize UDF-related code to use uniformly UDF instead of Udf
 --

 Key: SPARK-8478
 URL: https://issues.apache.org/jira/browse/SPARK-8478
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Priority: Minor

 Some UDF-related code uses Udf naming instead of UDF.
 This JIRA uniformizes the naming in favor of UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593362#comment-14593362
 ] 

Benjamin Fradet commented on SPARK-8356:


I'll create a separate JIRA for harmonizing the naming in UDF-related code: 
[SPARK-8478|https://issues.apache.org/jira/browse/SPARK-8478].

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8478) Harmonize UDF-related code to use uniformly UDF instead of Udf

2015-06-19 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-8478:
--

 Summary: Harmonize UDF-related code to use uniformly UDF instead 
of Udf
 Key: SPARK-8478
 URL: https://issues.apache.org/jira/browse/SPARK-8478
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Benjamin Fradet
Priority: Minor


Some UDF-related code uses Udf naming instead of UDF.
This JIRA uniformizes the naming in favor of UDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-19 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593362#comment-14593362
 ] 

Benjamin Fradet edited comment on SPARK-8356 at 6/19/15 12:02 PM:
--

I've created a separate JIRA for harmonizing the naming in UDF-related code: 
[SPARK-8478|https://issues.apache.org/jira/browse/SPARK-8478].


was (Author: benfradet):
I'll create a separate JIRA for harmonizing the naming in UDF-related code: 
[SPARK-8478|https://issues.apache.org/jira/browse/SPARK-8478].

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590491#comment-14590491
 ] 

Benjamin Fradet commented on SPARK-8356:


Somewhat related, about being coherent, there is {{PythonUDF}} and 
{{ScalaUdf}}. Maybe we should straighten this up as well.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590513#comment-14590513
 ] 

Benjamin Fradet commented on SPARK-8356:


Ok, I'll make sure Udf disappear, should I open another JIRA or can I add it to 
the PR for this one?

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590478#comment-14590478
 ] 

Benjamin Fradet commented on SPARK-8356:


[~marmbrus] Are we sure {{callUDF}} is used for calling Java functions?

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-17 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590521#comment-14590521
 ] 

Benjamin Fradet commented on SPARK-8356:


Ok, thanks a lot for your pointers.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-8399:
--

 Summary: Overlap between histograms and axis' name in Spark 
Streaming UI
 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor


If you have an histogram skewed towards the maximum of the displayed values as 
is the case with the number of messages processed per batchInterval with the 
Kafka direct API (since it's a constant) for example, the histogram will 
overlap with the name of the X axis (#batches).

Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8399) Overlap between histograms and axis' name in Spark Streaming UI

2015-06-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588358#comment-14588358
 ] 

Benjamin Fradet commented on SPARK-8399:


I'll submit a patch shortly.

 Overlap between histograms and axis' name in Spark Streaming UI
 ---

 Key: SPARK-8399
 URL: https://issues.apache.org/jira/browse/SPARK-8399
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.4.0
Reporter: Benjamin Fradet
Priority: Minor

 If you have an histogram skewed towards the maximum of the displayed values 
 as is the case with the number of messages processed per batchInterval with 
 the Kafka direct API (since it's a constant) for example, the histogram will 
 overlap with the name of the X axis (#batches).
 Unfortunately, I don't have any screenshots available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8356) Reconcile callUDF and callUdf

2015-06-16 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588579#comment-14588579
 ] 

Benjamin Fradet commented on SPARK-8356:


I've started working on this issue.

 Reconcile callUDF and callUdf
 -

 Key: SPARK-8356
 URL: https://issues.apache.org/jira/browse/SPARK-8356
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical
  Labels: starter

 Right now we have two functions {{callUDF}} and {{callUdf}}.  I think the 
 former is used for calling Java functions (and the documentation is wrong) 
 and the latter is for calling functions by name.  Either way this is 
 confusing and we should unify or pick different names.  Also, lets make sure 
 the docs are right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7255) spark.streaming.kafka.maxRetries not documented

2015-04-29 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-7255:
--

 Summary: spark.streaming.kafka.maxRetries not documented
 Key: SPARK-7255
 URL: https://issues.apache.org/jira/browse/SPARK-7255
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Streaming
Affects Versions: 1.3.1
Reporter: Benjamin Fradet
Priority: Minor
 Fix For: 1.4.0


I noticed there was no documentation for 
[spark.streaming.kafka.maxRetries|https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L66]
 was not documented in the [configuration 
pagehttp://spark.apache.org/docs/latest/configuration.html#spark-streaming].

Is this on purpose?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7255) spark.streaming.kafka.maxRetries not documented

2015-04-29 Thread Benjamin Fradet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520266#comment-14520266
 ] 

Benjamin Fradet commented on SPARK-7255:


Otherwise, I'd be glad to add it to the docs.

 spark.streaming.kafka.maxRetries not documented
 ---

 Key: SPARK-7255
 URL: https://issues.apache.org/jira/browse/SPARK-7255
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Streaming
Affects Versions: 1.3.1
Reporter: Benjamin Fradet
Priority: Minor
 Fix For: 1.4.0


 I noticed there was no documentation for 
 [spark.streaming.kafka.maxRetries|https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala#L66]
  was not documented in the [configuration 
 pagehttp://spark.apache.org/docs/latest/configuration.html#spark-streaming].
 Is this on purpose?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org