[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2018-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591667#comment-16591667 ] Barry Becker commented on SPARK-11215: -- Is the main motivation for this feature performance? Can

[jira] [Commented] (SPARK-9610) Class and instance weighting for ML

2018-08-15 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581253#comment-16581253 ] Barry Becker commented on SPARK-9610: - All ML models should support having and optional weighting

[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2018-08-03 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568718#comment-16568718 ] Barry Becker commented on SPARK-21986: -- Here are a couple more test cases that show the problem:

[jira] [Created] (SPARK-24394) Nodes in decision tree sometimes have negative impurity values

2018-05-25 Thread Barry Becker (JIRA)
Barry Becker created SPARK-24394: Summary: Nodes in decision tree sometimes have negative impurity values Key: SPARK-24394 URL: https://issues.apache.org/jira/browse/SPARK-24394 Project: Spark

[jira] [Commented] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444202#comment-16444202 ] Barry Becker commented on SPARK-24019: -- Lowering to minor because I found a way to specify the

[jira] [Comment Edited] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444202#comment-16444202 ] Barry Becker edited comment on SPARK-24019 at 4/19/18 3:07 PM: --- Lowering to

[jira] [Updated] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-24019: - Priority: Minor (was: Major) > AnalysisException for Window function expression to compute

[jira] [Created] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-18 Thread Barry Becker (JIRA)
Barry Becker created SPARK-24019: Summary: AnalysisException for Window function expression to compute derivative Key: SPARK-24019 URL: https://issues.apache.org/jira/browse/SPARK-24019 Project:

[jira] [Created] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node

2018-03-29 Thread Barry Becker (JIRA)
Barry Becker created SPARK-23824: Summary: Make inpurityStats publicly accessible in ml.tree.Node Key: SPARK-23824 URL: https://issues.apache.org/jira/browse/SPARK-23824 Project: Spark Issue

[jira] [Commented] (SPARK-6162) Handle missing values in GBM

2018-03-27 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415600#comment-16415600 ] Barry Becker commented on SPARK-6162: - If we all agree that is is something that would be very nice to

[jira] [Commented] (SPARK-8529) Set metadata for MinMaxScaler

2018-02-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369194#comment-16369194 ] Barry Becker commented on SPARK-8529: - Complementing the output metadata in what way? Need more info

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-11-07 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064972#comment-16064972 ] Barry Becker edited comment on SPARK-20226 at 11/7/17 6:09 PM: --- Calling

[jira] [Commented] (SPARK-9610) Class and instance weighting for ML

2017-10-25 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219005#comment-16219005 ] Barry Becker commented on SPARK-9610: - Frequent item sets (associations) could use it too. > Class

[jira] [Commented] (SPARK-7276) withColumn is very slow on dataframe with large number of columns

2017-09-15 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167814#comment-16167814 ] Barry Becker commented on SPARK-7276: - Isn't there still a problem with withColumn performance in

[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163768#comment-16163768 ] Barry Becker commented on SPARK-21986: -- But wait, the dataset I discovered the problem with was not

[jira] [Created] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-21986: Summary: QuantileDiscretizer picks wrong split point for data with lots of 0's Key: SPARK-21986 URL: https://issues.apache.org/jira/browse/SPARK-21986 Project: Spark

[jira] [Commented] (SPARK-14155) Hide UserDefinedType in Spark 2.0

2017-09-06 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155790#comment-16155790 ] Barry Becker commented on SPARK-14155: -- Does it work with datasets now in 2.1? > Hide

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-06-27 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064972#comment-16064972 ] Barry Becker commented on SPARK-20226: -- Calling cache() on the dataframe on the after the addColumn

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-05-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018840#comment-16018840 ] Barry Becker commented on SPARK-16845: -- I checked out the the v2.1.1 tag of spark from github, but

[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-05-10 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004688#comment-16004688 ] Barry Becker commented on SPARK-20542: -- @viirya, your implementation of MultipleBucketizer relies on

[jira] [Commented] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-05-09 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002609#comment-16002609 ] Barry Becker commented on SPARK-20542: -- This is a great improvement, @viirya! According to your

[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-09 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002582#comment-16002582 ] Barry Becker commented on SPARK-19581: -- I think its just a matter of sending a feature vector of

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-08 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001388#comment-16001388 ] Barry Becker commented on SPARK-13747: -- Good to hear that your workaround was successful. How did

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-08 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001303#comment-16001303 ] Barry Becker commented on SPARK-13747: -- @saif1988, just to clarify, did you add the following?

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-08 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001186#comment-16001186 ] Barry Becker commented on SPARK-13747: -- I also tried the "thread-pool-executor" workaround suggested

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-08 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000830#comment-16000830 ] Barry Becker commented on SPARK-13747: -- There seems to be some related discussion here

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-27 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987706#comment-15987706 ] Barry Becker commented on SPARK-20392: -- Thanks for working on a fix. Do you have any idea which

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20392: - Attachment: model_9756.zip blockbuster_fewCols.csv attaching

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981386#comment-15981386 ] Barry Becker commented on SPARK-20392: -- [~viirya] that is correct. If I reduce the dataset to just

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979049#comment-15979049 ] Barry Becker edited comment on SPARK-20392 at 4/21/17 4:49 PM: --- Yes

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20392: - Attachment: model_9754.zip Attaching the parquet pipeline (as zip). > Slow performance when

[jira] [Comment Edited] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979049#comment-15979049 ] Barry Becker edited comment on SPARK-20392 at 4/21/17 4:46 PM: --- Yes

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979049#comment-15979049 ] Barry Becker commented on SPARK-20392: -- Yes [~kiszk], I was able to create a simple program that

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20392: - Attachment: giant_query_plan_for_fitting_pipeline.txt Giant nested query plan using when calling

[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20392: - Attachment: blockbuster.csv Attaching blockbuster.csv data file with many columns, but few rows.

[jira] [Created] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-19 Thread Barry Becker (JIRA)
Barry Becker created SPARK-20392: Summary: Slow performance when calling fit on ML pipeline for dataset with many columns but few rows Key: SPARK-20392 URL: https://issues.apache.org/jira/browse/SPARK-20392

[jira] [Commented] (SPARK-6509) MDLP discretizer

2017-04-18 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972744#comment-15972744 ] Barry Becker commented on SPARK-6509: - As further proof of relevance, I will be giving a

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-07 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960868#comment-15960868 ] Barry Becker commented on SPARK-20226: -- Only 11 columns. I did not want to wait for 10 or 20 minutes

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-07 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960806#comment-15960806 ] Barry Becker commented on SPARK-20226: -- OK, I set the flag using

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959134#comment-15959134 ] Barry Becker commented on SPARK-20226: -- Yes. We are running through spark job-server, and local.conf

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959024#comment-15959024 ] Barry Becker edited comment on SPARK-20226 at 4/6/17 2:45 PM: -- I set

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-06 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959024#comment-15959024 ] Barry Becker commented on SPARK-20226: -- I set spark.sql.constraintPropagation.enabled to false in

[jira] [Updated] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20226: - Attachment: profile_indexer2.PNG A snapshot of the hotspot sampler from JVisualVM while

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957732#comment-15957732 ] Barry Becker commented on SPARK-20226: -- I did some profiling using the sampler in JVisualVM and took

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957489#comment-15957489 ] Barry Becker commented on SPARK-20226: -- I thought the problem was in the cacheTable call because

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957457#comment-15957457 ] Barry Becker commented on SPARK-20226: -- It seems like it has to do with the interaction between the

[jira] [Updated] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20226: - Labels: cache (was: ) > Call to sqlContext.cacheTable takes an incredibly long time in some

[jira] [Comment Edited] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957296#comment-15957296 ] Barry Becker edited comment on SPARK-20226 at 4/5/17 5:36 PM: -- We noticed

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957296#comment-15957296 ] Barry Becker commented on SPARK-20226: -- We noticed that this is reproducible just by adding a new

[jira] [Updated] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20226: - Attachment: xyzzy.csv Attaching the datafile, but I don't think it is significant. This problem

[jira] [Updated] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-20226: - Description: I have a case where the call to sqlContext.cacheTable can take an arbitrarily long

[jira] [Created] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-05 Thread Barry Becker (JIRA)
Barry Becker created SPARK-20226: Summary: Call to sqlContext.cacheTable takes an incredibly long time in some cases Key: SPARK-20226 URL: https://issues.apache.org/jira/browse/SPARK-20226 Project:

[jira] [Commented] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

2017-03-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938475#comment-15938475 ] Barry Becker commented on SPARK-20071: -- Yes. I agree. I wanted to report the issue, but wasn't sure

[jira] [Created] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

2017-03-23 Thread Barry Becker (JIRA)
Barry Becker created SPARK-20071: Summary: StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values Key: SPARK-20071 URL:

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-03-22 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936997#comment-15936997 ] Barry Becker commented on SPARK-13747: -- We have hit this on rare instances in our production

[jira] [Created] (SPARK-19699) createOrReplaceTable does not always replace an existing table of the same name

2017-02-22 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19699: Summary: createOrReplaceTable does not always replace an existing table of the same name Key: SPARK-19699 URL: https://issues.apache.org/jira/browse/SPARK-19699

[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-02-13 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863819#comment-15863819 ] Barry Becker commented on SPARK-19581: -- I agree with minor prioritization, since there is an easy

[jira] [Created] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-02-13 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19581: Summary: running NaiveBayes model with 0 features can crash the executor with D rorreGEMV Key: SPARK-19581 URL: https://issues.apache.org/jira/browse/SPARK-19581

[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%

2017-01-25 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838617#comment-15838617 ] Barry Becker commented on SPARK-4049: - I read the comments, but I'm still not really sure what over

[jira] [Commented] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15834790#comment-15834790 ] Barry Becker commented on SPARK-19317: -- I figured out a workaround for this problem. The problem was

[jira] [Updated] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-19317: - Priority: Minor (was: Major) > UnsupportedOperationException: empty.reduceLeft in

[jira] [Updated] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-19317: - Description: I wish I had more of a simple reproducible case to give, but I got the below

[jira] [Commented] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15834700#comment-15834700 ] Barry Becker commented on SPARK-19317: -- As far as I can tell, this only occurs when filtering for

[jira] [Updated] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-23 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-19317: - Description: I wish I had more of a simple reproducible case to give, but I got the below

[jira] [Created] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-20 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19317: Summary: UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized Key: SPARK-19317 URL: https://issues.apache.org/jira/browse/SPARK-19317 Project: Spark

[jira] [Created] (SPARK-19245) Cannot build spark-assembly jar

2017-01-16 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19245: Summary: Cannot build spark-assembly jar Key: SPARK-19245 URL: https://issues.apache.org/jira/browse/SPARK-19245 Project: Spark Issue Type: Documentation

[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-20 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762805#comment-15762805 ] Barry Becker edited comment on SPARK-16845 at 12/20/16 9:24 PM: I found a

[jira] [Commented] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2016-12-20 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765096#comment-15765096 ] Barry Becker commented on SPARK-11293: -- Not sure if this is related, but I am running on spark 2.0.2

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-19 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762805#comment-15762805 ] Barry Becker commented on SPARK-16845: -- I found a workaround that allows me to avoid the 64 KB

[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2016-12-05 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722696#comment-15722696 ] Barry Becker commented on SPARK-11215: -- This would be a good feature. It might be nice to add an

[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2016-11-29 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706389#comment-15706389 ] Barry Becker commented on SPARK-18502: -- Is there a way to escape the backtick when it appears in a

[jira] [Comment Edited] (SPARK-13913) DataFrame.withColumn fails when trying to replace existing column with dot in name

2016-11-18 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677144#comment-15677144 ] Barry Becker edited comment on SPARK-13913 at 11/18/16 5:02 PM: I can

[jira] [Commented] (SPARK-13913) DataFrame.withColumn fails when trying to replace existing column with dot in name

2016-11-18 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677144#comment-15677144 ] Barry Becker commented on SPARK-13913: -- I can still reproduce this using spark 1.6.3. My dataframe

[jira] [Created] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2016-11-18 Thread Barry Becker (JIRA)
Barry Becker created SPARK-18502: Summary: Spark does not handle columns that contain backquote (`) Key: SPARK-18502 URL: https://issues.apache.org/jira/browse/SPARK-18502 Project: Spark

[jira] [Commented] (SPARK-11977) Support accessing a DataFrame column using its name without backticks if the name contains '.'

2016-11-18 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676856#comment-15676856 ] Barry Becker commented on SPARK-11977: -- I would also like to know how to handle columns that contain

[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2016-11-17 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674129#comment-15674129 ] Barry Becker commented on SPARK-12965: -- This is a big issue for us because we don't control the

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-11-14 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15664866#comment-15664866 ] Barry Becker commented on SPARK-16845: -- I am encountering a similar exception in spark 1.6.3 when

[jira] [Commented] (SPARK-14138) Generated SpecificColumnarIterator code can exceed JVM size limit for cached DataFrames

2016-11-14 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15664842#comment-15664842 ] Barry Becker commented on SPARK-14138: -- I am using spark 1.6.3 on a DataFrame with 204 columns.

[jira] [Commented] (SPARK-18181) Huge managed memory leak (2.7G) when running reduceByKey

2016-10-31 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623018#comment-15623018 ] Barry Becker commented on SPARK-18181: -- For this case to leak a lot of memory, I bin the numeric

[jira] [Created] (SPARK-18181) Huge managed memory leak (2.7G) when running reduceByKey

2016-10-31 Thread Barry Becker (JIRA)
Barry Becker created SPARK-18181: Summary: Huge managed memory leak (2.7G) when running reduceByKey Key: SPARK-18181 URL: https://issues.apache.org/jira/browse/SPARK-18181 Project: Spark

[jira] [Commented] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-10-31 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622975#comment-15622975 ] Barry Becker commented on SPARK-14363: -- I am hitting this issue in 1.6.2. In fact, I can make a case

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-10-22 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597816#comment-15597816 ] Barry Becker commented on SPARK-18054: -- Ah. That is quite likely the problem. I will verify next

[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-10-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15596382#comment-15596382 ] Barry Becker commented on SPARK-16216: -- Yes, That worked. Thanks for the workaround! If I use

[jira] [Created] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-10-21 Thread Barry Becker (JIRA)
Barry Becker created SPARK-18054: Summary: Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type Key: SPARK-18054 URL:

[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-10-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595619#comment-15595619 ] Barry Becker commented on SPARK-16216: -- If timezone is not specified, the date should be interpreted

[jira] [Comment Edited] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-10-21 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595619#comment-15595619 ] Barry Becker edited comment on SPARK-16216 at 10/21/16 4:41 PM: If

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-07 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1968#comment-1968 ] Barry Becker commented on SPARK-17219: -- I'll make another attempt to clarify my use case. Nulls are

[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-08-31 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453414#comment-15453414 ] Barry Becker commented on SPARK-14234: -- Is it a lot of work to backport this fix 1.6.3? We have an

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436819#comment-15436819 ] Barry Becker commented on SPARK-17219: -- In my opinion, yes. It is something that applies to all

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-25 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436767#comment-15436767 ] Barry Becker commented on SPARK-17219: -- If you support the different strategies as R does, please

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435651#comment-15435651 ] Barry Becker commented on SPARK-17219: -- If the decision is to have an additional null/NaN bucket,

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435484#comment-15435484 ] Barry Becker commented on SPARK-17219: -- Nulls were not accepted in the column. I had to change them

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435403#comment-15435403 ] Barry Becker commented on SPARK-17219: -- There needs to be some way to handle null values when

[jira] [Updated] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-17086: - Attachment: titanic.csv > QuantileDiscretizer throws InvalidArgumentException (parameter splits

[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435236#comment-15435236 ] Barry Becker commented on SPARK-17086: -- Thanks. BTW, I hope there are some test cases where the

[jira] [Comment Edited] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434805#comment-15434805 ] Barry Becker edited comment on SPARK-17086 at 8/24/16 12:18 PM: Is it

[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-24 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434805#comment-15434805 ] Barry Becker commented on SPARK-17086: -- Is it possible to get this fix into 2.0.1? Maybe it would be

[jira] [Commented] (SPARK-6509) MDLP discretizer

2016-08-22 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431341#comment-15431341 ] Barry Becker commented on SPARK-6509: - I may have missed the reasoning somewhere, but why was this

[jira] [Commented] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-18 Thread Barry Becker (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426549#comment-15426549 ] Barry Becker commented on SPARK-17086: -- I think I agree with the discussion. Here is a summary of

[jira] [Created] (SPARK-17086) QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data

2016-08-16 Thread Barry Becker (JIRA)
Barry Becker created SPARK-17086: Summary: QuantileDiscretizer throws InvalidArgumentException (parameter splits given invalid value) on valid data Key: SPARK-17086 URL:

  1   2   >