date:20160413

[jira] [Assigned] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14374:


Assignee: (was: Apache Spark)

> PySpark ml GBTClassifier, Regressor support export/import
> -
>
> Key: SPARK-14374
> URL: https://issues.apache.org/jira/browse/SPARK-14374
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14374:


Assignee: Apache Spark

> PySpark ml GBTClassifier, Regressor support export/import
> -
>
> Key: SPARK-14374
> URL: https://issues.apache.org/jira/browse/SPARK-14374
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14374) PySpark ml GBTClassifier, Regressor support export/import

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240652#comment-15240652
 ] 

Apache Spark commented on SPARK-14374:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12383

> PySpark ml GBTClassifier, Regressor support export/import
> -
>
> Key: SPARK-14374
> URL: https://issues.apache.org/jira/browse/SPARK-14374
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14463) read.text broken for partitioned tables

2016-04-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240640#comment-15240640
 ] 

Cheng Lian commented on SPARK-14463:


Should we simply throw an exception when text data source is used together with 
partitioning?

> read.text broken for partitioned tables
> ---
>
> Key: SPARK-14463
> URL: https://issues.apache.org/jira/browse/SPARK-14463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>
> Strongly typing the return values of {{read.text}} as {{Dataset\[String]}} 
> breaks when trying to load a partitioned table (or any table where the path 
> looks partitioned)
> {code}
> Seq((1, "test"))
>   .toDF("a", "b")
>   .write
>   .format("text")
>   .partitionBy("a")
>   .save("/home/michael/text-part-bug")
> sqlContext.read.text("/home/michael/text-part-bug")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: Try to map struct 
> to Tuple1, but failed as the number of fields does not line up.
>  - Input schema: struct
>  - Target schema: struct;
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:265)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:279)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:197)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>   at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
>   at org.apache.spark.sql.Dataset.as(Dataset.scala:357)
>   at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:450)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14624) Error at the end of installing Spark 1.6.1 using Spark-ec2 scipt

2016-04-13 Thread Mohaed Alibrahim (JIRA)

Mohaed Alibrahim created SPARK-14624:


 Summary: Error at the end of installing Spark 1.6.1 using 
Spark-ec2 scipt
 Key: SPARK-14624
 URL: https://issues.apache.org/jira/browse/SPARK-14624
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Mohaed Alibrahim


I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was 
OK, but , it failed to start httpd at the end of the installation. I followed 
exactly the instruction and I repeated the process many times, but there is no 
luck.

-
[timing] rstudio setup:  00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to 
slaves...ec.us-west-2.compute.amazonaws.com
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Shutting down GANGLIA gmond:   [FAILED]
Starting GANGLIA gmond:[  OK  ]
Connection to ec2-.us-west-2.compute.amazonaws.com 
closed.
Shutting down GANGLIA gmetad:  [FAILED]
Starting GANGLIA gmetad:   [  OK  ]
Stopping httpd:[FAILED]
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: 
Cannot load /etc/httpd/modules/mod_authz_core.so into server: 
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such 
file or directory
   [FAILED]
[timing] ganglia setup:  00h 00m 01s
Connection to ec2-.us-west-2.compute.amazonaws.com closed.
Spark standalone cluster started at 
http://ec2-...us-west-2.compute.amazonaws.com:8080
Ganglia started at 
http://ec2-.us-west-2.compute.amazonaws.com:5080/ganglia
Done!
--

httpd.conf:

line 154:
 
LoadModule authz_core_module modules/mod_authz_core.so

So, If i commented this line, it shows another error in the following lines:

LoadModule unixd_module modules/mod_unixd.so
LoadModule access_compat_module modules/mod_access_compat.so
LoadModule mpm_prefork_module modules/mod_mpm_prefork.so
LoadModule php5_module modules/libphp-5.6.so

---



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14507) Decide if we should still support CREATE EXTERNAL TABLE AS SELECT

2016-04-13 Thread Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240616#comment-15240616
 ] 

Yan commented on SPARK-14507:
-

In terms of Hive support vs Spark SQL support, the "external table"  concept in 
Spark SQL seems to be beyond that in Hive, not just for CTAS. For Hive,
an "external table" is only for the "schema-on-read" scenario on the data on, 
say, HDFS. It has its own kinda unique DDL semantics and security features
different from normal SQL DB's. For Spark SQL's external table, as far as I 
understand, it could be a mapping to a data source table. I'm not sure whether 
this mapping would need special considerations regarding DDL semantics and 
security models as Hive external tables. 

> Decide if we should still support CREATE EXTERNAL TABLE AS SELECT
> -
>
> Key: SPARK-14507
> URL: https://issues.apache.org/jira/browse/SPARK-14507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Look like we support CREATE EXTERNAL TABLE AS SELECT by accident. Should we 
> still support it? Seems Hive does not support it. Based on the doc Impala, 
> seems Impala supports it. Right now, seems the rule of CreateTables in 
> HiveMetastoreCatalog.scala does not respect EXTERNAL keyword when 
> {{hive.convertCTAS}} is true and the CTAS query does not provide any storage 
> format. For this case, the table will become a MANAGED_TABLE and stored in 
> the default metastore location (not the user specified location). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid

2016-04-13 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240608#comment-15240608
 ] 

Xiao Li commented on SPARK-14603:
-

To verify the error messages we issued from SessionCatalog and 
HiveSessionCatalog, we need to unify the error message we issued. Thus, I have 
to unify the exceptions at first.

> SessionCatalog needs to check if a metadata operation is valid
> --
>
> Key: SPARK-14603
> URL: https://issues.apache.org/jira/browse/SPARK-14603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> Since we cannot really trust if the underlying external catalog can throw 
> exceptions when there is an invalid metadata operation, let's do it in 
> SessionCatalog. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Yong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240607#comment-15240607
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] [~josephkb]. Yes I think wrapping RankingMetrics could be the 
first step and reimplementing all RankingEvaluator methods in ML using 
DataFrames would be good after that. I will work on the reimplementation in 
several followup PRs.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-13 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240601#comment-15240601
 ] 

Yanbo Liang commented on SPARK-10574:
-

Sure, I will sent a PR in a few days. Thanks!

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14622) Retain lost executors status

2016-04-13 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240550#comment-15240550
 ] 

hujiayin commented on SPARK-14622:
--

I think it is also better to maintain the number of lost executors. When click 
the number, the details information appear after that.

> Retain lost executors status
> 
>
> Key: SPARK-14622
> URL: https://issues.apache.org/jira/browse/SPARK-14622
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Qingyang Hong
>Priority: Minor
> Fix For: 1.6.0
>
>
> In 'execturos' dashboard, it is necessary to maintain a list of those 
> executors who have been lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14623) add label binarizer

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14623:


Assignee: (was: Apache Spark)

> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: hujiayin
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 0, 1, 
> 0, 1, 0, 1, 0, 
> 0, 0, 1, 0, 0, 
> 1, 0, 0, 0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14623) add label binarizer

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240543#comment-15240543
 ] 

Apache Spark commented on SPARK-14623:
--

User 'hujy' has created a pull request for this issue:
https://github.com/apache/spark/pull/12380

> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: hujiayin
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 0, 1, 
> 0, 1, 0, 1, 0, 
> 0, 0, 1, 0, 0, 
> 1, 0, 0, 0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14623) add label binarizer

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14623:


Assignee: Apache Spark

> add label binarizer 
> 
>
> Key: SPARK-14623
> URL: https://issues.apache.org/jira/browse/SPARK-14623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: hujiayin
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 0, 1, 
> 0, 1, 0, 1, 0, 
> 0, 0, 1, 0, 0, 
> 1, 0, 0, 0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14623) add label binarizer

2016-04-13 Thread hujiayin (JIRA)

hujiayin created SPARK-14623:


 Summary: add label binarizer 
 Key: SPARK-14623
 URL: https://issues.apache.org/jira/browse/SPARK-14623
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.1
Reporter: hujiayin
Priority: Minor
 Fix For: 2.0.0


It relates to https://issues.apache.org/jira/browse/SPARK-7445

Map the labels to 0/1. 
For example,
Input:
"yellow,green,red,green,0"
The labels: "0, green, red, yellow"
Output:
0, 0, 0, 0, 1, 
0, 1, 0, 1, 0, 
0, 0, 1, 0, 0, 
1, 0, 0, 0, 0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14622) Retain lost executors status

2016-04-13 Thread Qingyang Hong (JIRA)

Qingyang Hong created SPARK-14622:
-

 Summary: Retain lost executors status
 Key: SPARK-14622
 URL: https://issues.apache.org/jira/browse/SPARK-14622
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Qingyang Hong
Priority: Minor
 Fix For: 1.6.0


In 'execturos' dashboard, it is necessary to maintain a list of those executors 
who have been lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7445) StringIndexer should handle binary labels properly

2016-04-13 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-7445:

Comment: was deleted

(was: If no one works on it, I'd like to submit a code for this issue.)

> StringIndexer should handle binary labels properly
> --
>
> Key: SPARK-7445
> URL: https://issues.apache.org/jira/browse/SPARK-7445
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> StringIndexer orders labels by their counts. However, for binary labels, we 
> should really map negatives to 0 and positive to 1. So can put special rules 
> for binary labels:
> 1. "+1"/"-1", "1"/"-1", "1"/"0"
> 2. "yes"/"no"
> 3. "true"/"false"
> Another option is to allow users to provide a list or labels and we use the 
> ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14609) LOAD DATA

2016-04-13 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240509#comment-15240509
 ] 

Xiao Li commented on SPARK-14609:
-

It is not hard, but we need to handle partitions and a few options.  
{noformat}
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION 
(partcol1=val1, partcol2=val2 ...)]
{noformat}

I can take it. Thanks!

> LOAD DATA
> -
>
> Key: SPARK-14609
> URL: https://issues.apache.org/jira/browse/SPARK-14609
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For load command, it should be pretty easy to implement. We already call Hive 
> to load data when insert into Hive tables. So, we can follow the 
> implementation of that. For example, we load into hive table in 
> InsertIntoHiveTable command at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L221-L225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14621) add oracle hint optimizer

2016-04-13 Thread Qingyang Hong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingyang Hong updated SPARK-14621:
--
  Flags: Patch
   Priority: Minor  (was: Major)
Description: Current SQL parser in SparkSQL can't identify hint optimizer 
in query, e.g.  SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, 
DATE FROM BILL_TABLE. It is necessary to add such feature which will increase 
query efficiency.
 Issue Type: Improvement  (was: Wish)
Summary: add oracle hint optimizer  (was: add)

> add oracle hint optimizer
> -
>
> Key: SPARK-14621
> URL: https://issues.apache.org/jira/browse/SPARK-14621
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Qingyang Hong
>Priority: Minor
> Fix For: 1.6.0
>
>
> Current SQL parser in SparkSQL can't identify hint optimizer in query, e.g.  
> SELECT /*+index(o IDX_BILLORDER_SEND_UPDATE)+*/ ID, BILL_CODE, DATE FROM 
> BILL_TABLE. It is necessary to add such feature which will increase query 
> efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12133) Support dynamic allocation in Spark Streaming

2016-04-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12133.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Support dynamic allocation in Spark Streaming
> -
>
> Key: SPARK-12133
> URL: https://issues.apache.org/jira/browse/SPARK-12133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Andrew Or
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
> Attachments: dynamic-allocation-streaming-design.pdf
>
>
> Dynamic allocation is a feature that allows your cluster resources to scale 
> up and down based on the workload. Currently it doesn't work well with Spark 
> streaming because of several reasons:
> (1) Your executors may never be idle since they run something every N seconds
> (2) You should have at least one receiver running always
> (3) The existing heuristics don't take into account length of batch queue
> ...
> The goal of this JIRA is to provide better support for using dynamic 
> allocation in streaming. A design doc will be posted shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14621) add

2016-04-13 Thread Qingyang Hong (JIRA)

Qingyang Hong created SPARK-14621:
-

 Summary: add
 Key: SPARK-14621
 URL: https://issues.apache.org/jira/browse/SPARK-14621
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 1.6.0
Reporter: Qingyang Hong
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14592) Create table like

2016-04-13 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-14592:

Comment: was deleted

(was: I am working on this...)

> Create table like
> -
>
> Key: SPARK-14592
> URL: https://issues.apache.org/jira/browse/SPARK-14592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14592) Create table like

2016-04-13 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-14592:

Comment: was deleted

(was: Will submit PR soon.)

> Create table like
> -
>
> Key: SPARK-14592
> URL: https://issues.apache.org/jira/browse/SPARK-14592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12133) Support dynamic allocation in Spark Streaming

2016-04-13 Thread WilliamZhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240453#comment-15240453
 ] 

WilliamZhu commented on SPARK-12133:


Here have a new Design:  http://www.jianshu.com/p/ae7fdd4746f6 

> Support dynamic allocation in Spark Streaming
> -
>
> Key: SPARK-12133
> URL: https://issues.apache.org/jira/browse/SPARK-12133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Andrew Or
>Assignee: Tathagata Das
> Attachments: dynamic-allocation-streaming-design.pdf
>
>
> Dynamic allocation is a feature that allows your cluster resources to scale 
> up and down based on the workload. Currently it doesn't work well with Spark 
> streaming because of several reasons:
> (1) Your executors may never be idle since they run something every N seconds
> (2) You should have at least one receiver running always
> (3) The existing heuristics don't take into account length of batch queue
> ...
> The goal of this JIRA is to provide better support for using dynamic 
> allocation in streaming. A design doc will be posted shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14516) Clustering evaluator

2016-04-13 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240434#comment-15240434
 ] 

zhengruifeng edited comment on SPARK-14516 at 4/14/16 1:56 AM:
---

[~akamal] In my opinion, both supervised and unsupervised metrics shoud be 
added.Silhouette should be add first. I will create a online document. Thanks.


was (Author: podongfeng):
[~akamal] In my opinion, both supervised and unsupervised metrics shoud be 
added. And in unsupervised metrics, silhouette should be add first. I will 
create a online document. Thanks.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14516) Clustering evaluator

2016-04-13 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240434#comment-15240434
 ] 

zhengruifeng commented on SPARK-14516:
--

[~akamal] In my opinion, both supervised and unsupervised metrics shoud be 
added. And in unsupervised metrics, silhouette should be add first. I will 
create a online document. Thanks.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240411#comment-15240411
 ] 

Apache Spark commented on SPARK-14620:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12379

> Use/benchmark a better hash in AggregateHashMap
> ---
>
> Key: SPARK-14620
> URL: https://issues.apache.org/jira/browse/SPARK-14620
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14620:


Assignee: (was: Apache Spark)

> Use/benchmark a better hash in AggregateHashMap
> ---
>
> Key: SPARK-14620
> URL: https://issues.apache.org/jira/browse/SPARK-14620
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14620) Use/benchmark a better hash in AggregateHashMap

2016-04-13 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-14620:
--

 Summary: Use/benchmark a better hash in AggregateHashMap
 Key: SPARK-14620
 URL: https://issues.apache.org/jira/browse/SPARK-14620
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14582) Increase the parallelism for small tables

2016-04-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240389#comment-15240389
 ] 

Mark Hamstra commented on SPARK-14582:
--

The total absence of any description in both this JIRA and the accompanying PR 
is really annoying -- especially because I find that queries involving small 
tables frequently suffer from using too much parallelism, not too little.

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage

2016-04-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14619:

Description: 
When there are multiple attempts for a stage, we currently only reset internal 
accumulator values if all the tasks are resubmitted. It would make more sense 
to reset the accumulator values for each stage attempt. This will allow us to 
eventually get rid of the internal flag in the Accumulator class.



> Track internal accumulators (metrics) by stage attempt rather than stage
> 
>
> Key: SPARK-14619
> URL: https://issues.apache.org/jira/browse/SPARK-14619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> When there are multiple attempts for a stage, we currently only reset 
> internal accumulator values if all the tasks are resubmitted. It would make 
> more sense to reset the accumulator values for each stage attempt. This will 
> allow us to eventually get rid of the internal flag in the Accumulator class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240377#comment-15240377
 ] 

Apache Spark commented on SPARK-14619:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12378

> Track internal accumulators (metrics) by stage attempt rather than stage
> 
>
> Key: SPARK-14619
> URL: https://issues.apache.org/jira/browse/SPARK-14619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14619:


Assignee: Reynold Xin  (was: Apache Spark)

> Track internal accumulators (metrics) by stage attempt rather than stage
> 
>
> Key: SPARK-14619
> URL: https://issues.apache.org/jira/browse/SPARK-14619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14619:


Assignee: Apache Spark  (was: Reynold Xin)

> Track internal accumulators (metrics) by stage attempt rather than stage
> 
>
> Key: SPARK-14619
> URL: https://issues.apache.org/jira/browse/SPARK-14619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14618) RegressionEvaluator doc out of date

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14618:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> RegressionEvaluator doc out of date
> ---
>
> Key: SPARK-14618
> URL: https://issues.apache.org/jira/browse/SPARK-14618
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> In Spark 1.4, we negated some metrics from RegressionEvaluator since 
> CrossValidator always maximized metrics.  This was fixed in 1.5, but the docs 
> were not updated.  This issue is for updating the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14618) RegressionEvaluator doc out of date

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240374#comment-15240374
 ] 

Apache Spark commented on SPARK-14618:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12377

> RegressionEvaluator doc out of date
> ---
>
> Key: SPARK-14618
> URL: https://issues.apache.org/jira/browse/SPARK-14618
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> In Spark 1.4, we negated some metrics from RegressionEvaluator since 
> CrossValidator always maximized metrics.  This was fixed in 1.5, but the docs 
> were not updated.  This issue is for updating the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14618) RegressionEvaluator doc out of date

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14618:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> RegressionEvaluator doc out of date
> ---
>
> Key: SPARK-14618
> URL: https://issues.apache.org/jira/browse/SPARK-14618
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In Spark 1.4, we negated some metrics from RegressionEvaluator since 
> CrossValidator always maximized metrics.  This was fixed in 1.5, but the docs 
> were not updated.  This issue is for updating the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14618) RegressionEvaluator doc out of date

2016-04-13 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-14618:
-

 Summary: RegressionEvaluator doc out of date
 Key: SPARK-14618
 URL: https://issues.apache.org/jira/browse/SPARK-14618
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 1.6.1, 1.5.2, 2.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


In Spark 1.4, we negated some metrics from RegressionEvaluator since 
CrossValidator always maximized metrics.  This was fixed in 1.5, but the docs 
were not updated.  This issue is for updating the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14619) Track internal accumulators (metrics) by stage attempt rather than stage

2016-04-13 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-14619:
---

 Summary: Track internal accumulators (metrics) by stage attempt 
rather than stage
 Key: SPARK-14619
 URL: https://issues.apache.org/jira/browse/SPARK-14619
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240364#comment-15240364
 ] 

Joseph K. Bradley commented on SPARK-14489:
---

(Oh, I had not refreshed the page before commenting, but it looks like my 
comments mesh with Nick's.)

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240360#comment-15240360
 ] 

Joseph K. Bradley commented on SPARK-14489:
---

I'd to try to separate a few issues here based on use cases and suggest the 
"right thing to do" in each case:
* Deploying an ALSModel to make predictions: The model should make best-effort 
predictions, even for new users.  I'd say new users should get recommendations 
based on the average user, for both the explicit and implicit settings.  
Providing a Param which makes the model output NaN for unknown users seems 
reasonable as an additional feature.
* Evaluating an ALSModel on a held-out dataset: This is the same as the first 
case; the model should behave the same way it will when deployed.
* Model tuning using CrossValidator: I'm less sure about this.  Both of your 
suggestions seem reasonable (either returning NaN for missing users and 
ignoring NaN in the evaluator, or making best-effort predictions for all 
users).  I also suspect it would be worthwhile to examine literature to find 
what tends to be best.  E.g., should CrossValidator handle ranking specially by 
doing stratified sampling to divide each user or item's ratings evenly across 
folds of CV?

If we want the evaluator to be able to ignore NaNs, then I'd prefer we keep the 
current behavior as the default and provide a Param which allows users to 
ignore NaNs.  I'd be afraid of linear models not having enough regularization, 
getting NaNs in the coefficients, having all of its predictions ignored by the 
evaluator, etc.

What do you think?

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---
>
> Key: SPARK-14489
> URL: https://issues.apache.org/jira/browse/SPARK-14489
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: AWS EMR
>Reporter: Boris Clémençon 
>  Labels: patch
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics 
> "rmse", "mse", "r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly 
> generated. For large and sparse datasets, there is a significant probability 
> that at least one user of the validation set is missing in the training set, 
> hence generating a few NaN estimation with transform method and NaN 
> RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or 
> other metrics (ie, removing users or items in validation test that is missing 
> in the learning set). Send logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
> val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
> splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>   val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>   val validationDataset = sqlCtx.createDataFrame(validation, 
> schema).cache()
>   // multi-model training
>   logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>   val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>   trainingDataset.unpersist()
>   var i = 0
>   while (i < numModels) {
> // TODO: duplicate evaluator to take extra params from input
> val metric = eval.evaluate(models(i).transform(validationDataset, 
> epm(i)))
> logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
> metrics(i) += metric
> i += 1
>   }
>   validationDataset.unpersist()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14614) Add `bround` function

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14614:


Assignee: (was: Apache Spark)

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14614) Add `bround` function

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240351#comment-15240351
 ] 

Apache Spark commented on SPARK-14614:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12376

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14614) Add `bround` function

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14614:


Assignee: Apache Spark

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14617) Remove deprecated APIs in TaskMetrics

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14617:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove deprecated APIs in TaskMetrics
> -
>
> Key: SPARK-14617
> URL: https://issues.apache.org/jira/browse/SPARK-14617
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14617) Remove deprecated APIs in TaskMetrics

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14617:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove deprecated APIs in TaskMetrics
> -
>
> Key: SPARK-14617
> URL: https://issues.apache.org/jira/browse/SPARK-14617
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14617) Remove deprecated APIs in TaskMetrics

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240331#comment-15240331
 ] 

Apache Spark commented on SPARK-14617:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12375

> Remove deprecated APIs in TaskMetrics
> -
>
> Key: SPARK-14617
> URL: https://issues.apache.org/jira/browse/SPARK-14617
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14617) Remove deprecated APIs in TaskMetrics

2016-04-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14617:

Summary: Remove deprecated APIs in TaskMetrics  (was: Remove deprecated 
APIs in accumulators)

> Remove deprecated APIs in TaskMetrics
> -
>
> Key: SPARK-14617
> URL: https://issues.apache.org/jira/browse/SPARK-14617
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14617) Remove deprecated APIs in accumulators

2016-04-13 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-14617:
---

 Summary: Remove deprecated APIs in accumulators
 Key: SPARK-14617
 URL: https://issues.apache.org/jira/browse/SPARK-14617
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-13 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240322#comment-15240322
 ] 

Xiangrui Meng commented on SPARK-13944:
---

There are more production workflows using RDD-based APIs than DataFrame-based 
APIs since many users are still running Spark 1.4 or earlier. It would be nice 
if we can keep binary compatibility on RDD-based APIs in Spark 2.0. Using type 
alias is not a good solution because 1) it is not Java-compatible, 2) it 
introduces dependency from the RDD-based API to mllib-local, which means future 
development on mllib-local might cause behavior changes or break changes to the 
RDD-based API. Since we already decided that the RDD-based API would go into 
maintenance mode in Spark 2.0. Leaving some old code there won't increase 
maintenance cost, compared with the type alias.

We can provide a converter than converts all `mllib.linalg` types to 
`ml.linalg` types in Spark 2.0 to help users migrate to `ml.linalg`.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14607) Partition pruning is case sensitive even with HiveContext

2016-04-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-14607:
--

Assignee: Davies Liu

> Partition pruning is case sensitive even with HiveContext
> -
>
> Key: SPARK-14607
> URL: https://issues.apache.org/jira/browse/SPARK-14607
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> It should not be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly

2016-04-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14484.

Resolution: Fixed
  Assignee: Davies Liu

> Fail to create parquet filter if the column name does not match exactly
> ---
>
> Key: SPARK-14484
> URL: https://issues.apache.org/jira/browse/SPARK-14484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There will be exception about "no key found" from 
> ParquetFilters.createFilter()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14607) Partition pruning is case sensitive even with HiveContext

2016-04-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14607.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12371
[https://github.com/apache/spark/pull/12371]

> Partition pruning is case sensitive even with HiveContext
> -
>
> Key: SPARK-14607
> URL: https://issues.apache.org/jira/browse/SPARK-14607
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> It should not be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message

2016-04-13 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240299#comment-15240299
 ] 

Shixiong Zhu commented on SPARK-14559:
--

When this happens? When you are stopping the SparkContext?

> Netty RPC didn't check channel is active before sending message
> ---
>
> Key: SPARK-14559
> URL: https://issues.apache.org/jira/browse/SPARK-14559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: spark1.6.1 hadoop2.2.0 jdk1.8.0_65
>Reporter: cen yuhai
>
> I have a long-running service. After running for serveral hours, It throwed 
> these exceptions. I  found that before sending rpc request by calling sendRpc 
> method in TransportClient, there is no check that whether the channel is 
> still open or active ?
> java.nio.channels.ClosedChannelException
>  4865 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 5635696155204230556 to 
> bigdata-arch-hdp407.bh.diditaxi.com/10.234.23.107:55197: java.nio.
>   channels.ClosedChannelException
>  4866 java.nio.channels.ClosedChannelException
>  4867 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 7319486003318455703 to 
> bigdata-arch-hdp1235.bh.diditaxi.com/10.168.145.239:36439: java.nio.
>   channels.ClosedChannelException
>  4868 java.nio.channels.ClosedChannelException
>  4869 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9041854451893215954 to 
> bigdata-arch-hdp1398.bh.diditaxi.com/10.248.117.216:26801: java.nio.
>   channels.ClosedChannelException
>  4870 java.nio.channels.ClosedChannelException
>  4871 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 6046473497871624501 to 
> bigdata-arch-hdp948.bh.diditaxi.com/10.118.114.81:41903: java.nio.  
>   channels.ClosedChannelException
>  4872 java.nio.channels.ClosedChannelException
>  4873 16/04/12 11:24:00 ERROR TransportClient: Failed to send RPC 
> 9085605650438705047 to 
> bigdata-arch-hdp1126.bh.diditaxi.com/10.168.146.78:27023: java.nio.
>   channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14614) Add `bround` function

2016-04-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240300#comment-15240300
 ] 

Dongjoon Hyun commented on SPARK-14614:
---

Since 1.3.0. :)

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14614) Add `bround` function

2016-04-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240303#comment-15240303
 ] 

Dongjoon Hyun commented on SPARK-14614:
---

I'll send a PR soon. Actually, I tested hive 2.0 today.

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-04-13 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240291#comment-15240291
 ] 

DB Tsai commented on SPARK-13944:
-

Can you elaborate the  automatic conversion in VectorUDT? 

We will add some utilities for converting the vectors. Implicit conversion will 
be provided for users to migrate to new vector. Thanks.

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14610:


Assignee: Apache Spark

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14610:


Assignee: (was: Apache Spark)

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240274#comment-15240274
 ] 

Apache Spark commented on SPARK-14610:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12374

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14614) Add `bround` function

2016-04-13 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240259#comment-15240259
 ] 

Bo Meng commented on SPARK-14614:
-

I have tried on Hive 1.2.1, actually this function seems dropped out. 

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240252#comment-15240252
 ] 

Joseph K. Bradley commented on SPARK-14409:
---

Thanks for writing this!  I just made a few comments too.  Wrapping 
RankingMetrics seems fine to me, though later on it would be worth 
re-implementing it using DataFrames and testing performance changes.  The 
initial PR should not add new metrics, but follow-up ones can.

Also, we'll need to follow up this issue with one to think about how to use ALS 
with CrossValidator.  I'll comment on the linked JIRA for that.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Description: 
{code:title=tpcds q44}
 select  asceding.rnk, i1.i_product_name best_performing, i2.i_product_name 
worst_performing
from(select *
 from (select item_sk,rank() over (order by rank_col asc) rnk
   from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
 from store_sales ss1
 where ss_store_sk = 4
 group by ss_item_sk
 having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
rank_col
  from store_sales
  where ss_store_sk = 4
and ss_addr_sk is null
  group by ss_store_sk))V1)V11
 where rnk  < 11) asceding,
(select *
 from (select item_sk,rank() over (order by rank_col desc) rnk
   from (select ss_item_sk item_sk,avg(ss_net_profit) rank_col
 from store_sales ss1
 where ss_store_sk = 4
 group by ss_item_sk
 having avg(ss_net_profit) > 0.9*(select avg(ss_net_profit) 
rank_col
  from store_sales
  where ss_store_sk = 4
and ss_addr_sk is null
  group by ss_store_sk))V2)V21
 where rnk  < 11) descending,
item i1,
item i2
where asceding.rnk = descending.rnk
  and i1.i_item_sk=asceding.item_sk
  and i2.i_item_sk=descending.item_sk
order by asceding.rnk
 limit 100;

{code}

{noformat}
bin/spark-sql  --driver-memory 10g --verbose --master yarn-client  --packages 
com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors 80 
--executor-cores 2 --database hadoopds1g  -f q44.sql
{noformat}

{noformat}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition, None
+- WholeStageCodegen
   :  +- Project [item_sk#0,rank_col#1]
   : +- Filter havingCondition#219: boolean
   :+- TungstenAggregate(key=[ss_item_sk#12], 
functions=[(avg(ss_net_profit#32),mode=Final,isDistinct=false)], 
output=[havingCondition#219,item_sk#0,rank_col#1])
   :   +- INPUT
   +- Exchange hashpartitioning(ss_item_sk#12,200), None
  +- WholeStageCodegen
 :  +- TungstenAggregate(key=[ss_item_sk#12], 
functions=[(avg(ss_net_profit#32),mode=Partial,isDistinct=false)], 
output=[ss_item_sk#12,sum#612,count#613L])
 : +- Project [ss_item_sk#12,ss_net_profit#32]
 :+- Filter (ss_store_sk#17 = 4)
 :   +- INPUT
 +- Scan ParquetRelation: 
hadoopds1g.store_sales[ss_item_sk#12,ss_net_profit#32,ss_store_sk#17] 
InputPaths: 
hdfs://bigaperf116.svl.ibm.com:8020/apps/hive/warehouse/hadoopds1g.db/store_sales,
 PushedFilters: [EqualTo(ss_store_sk,4)]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:105)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:60)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.Window.doExecute(Window.scala:288)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:116)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at 
org.apache.spark.sql.execution.InputAdapter.upstream(WholeStageCodegen.scala:176)
at 
org.apache.spark.sql.execution.Filter.upstream(basicOperators.scala:73)
at 
org.apache.spark.sql.execution.Project.upstream(basicOperators.scala:35)
at 
org.apache.spark.sql.execution.WholeStageCodegen.doExecute(WholeStageCodegen.scala:279)
at

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Environment: (was: spark 1.5.1 (official binary distribution) running 
on hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8))

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>
> {code:title=/tmp/bug.py}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext()
> sqlc = SQLContext(sc)
> R = Row('id', 'foo')
> r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
> q = sqlc.createDataFrame(sc.parallelize([R('', 
> 'bar')]))
> q.write.parquet('/tmp/1.parq')
> q = sqlc.read.parquet('/tmp/1.parq')
> j = r.join(q, r.id == q.id)
> print j.count()
> {code}
> {noformat}
> [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
> [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
> {noformat}
> {noformat}
> 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 119.90324 ms
> Traceback (most recent call last):
>   File "/tmp/bug.py", line 13, in 
> print j.count()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 268, in count
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, 
> tree:
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#10L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#13L])
>TungstenProject
> BroadcastHashJoin [id#0], [id#8], BuildRight
>  TungstenProject [id#0]
>   Scan PhysicalRDD[id#0,foo#1]
>  ConvertToUnsafe
>   Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Note this happens only under following condition:
> # executor memory >= 32GB (doesn't fail with up to 31 GB)
> # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
> more then 24 chars)
> # q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14616:
---
Affects Version/s: (was: 1.5.1)
   2.0.0

> TreeNodeException running Q44 and 58 on Parquet tables
> --
>
> Key: SPARK-14616
> URL: https://issues.apache.org/jira/browse/SPARK-14616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 1.5.1 (official binary distribution) running on 
> hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)
>Reporter: JESSE CHEN
>
> {code:title=/tmp/bug.py}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext()
> sqlc = SQLContext(sc)
> R = Row('id', 'foo')
> r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
> q = sqlc.createDataFrame(sc.parallelize([R('', 
> 'bar')]))
> q.write.parquet('/tmp/1.parq')
> q = sqlc.read.parquet('/tmp/1.parq')
> j = r.join(q, r.id == q.id)
> print j.count()
> {code}
> {noformat}
> [user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
> [user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
> {noformat}
> {noformat}
> 15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 119.90324 ms
> Traceback (most recent call last):
>   File "/tmp/bug.py", line 13, in 
> print j.count()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
> 268, in count
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, 
> tree:
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#10L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#13L])
>TungstenProject
> BroadcastHashJoin [id#0], [id#8], BuildRight
>  TungstenProject [id#0]
>   Scan PhysicalRDD[id#0,foo#1]
>  ConvertToUnsafe
>   Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
> at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Note this happens only under following condition:
> # executor memory >= 32GB (doesn't fail with up to 31 GB)
> # the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
> more then 24 chars)
> # q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SPARK-14616) TreeNodeException running Q44 and 58 on Parquet tables

2016-04-13 Thread JESSE CHEN (JIRA)

JESSE CHEN created SPARK-14616:
--

 Summary: TreeNodeException running Q44 and 58 on Parquet tables
 Key: SPARK-14616
 URL: https://issues.apache.org/jira/browse/SPARK-14616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: spark 1.5.1 (official binary distribution) running on 
hadoop yarn 2.6 with parquet 1.5.0 (both from cdh5.4.8)
Reporter: JESSE CHEN


{code:title=/tmp/bug.py}
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

sc = SparkContext()
sqlc = SQLContext(sc)

R = Row('id', 'foo')
r = sqlc.createDataFrame(sc.parallelize([R('abc', 'foo')]))
q = sqlc.createDataFrame(sc.parallelize([R('', 'bar')]))
q.write.parquet('/tmp/1.parq')
q = sqlc.read.parquet('/tmp/1.parq')
j = r.join(q, r.id == q.id)
print j.count()
{code}

{noformat}
[user@sandbox test]$ spark-submit --executor-memory=32g /tmp/bug.py
[user@sandbox test]$ hadoop fs -rmr /tmp/1.parq
{noformat}

{noformat}
15/11/04 04:28:38 INFO codegen.GenerateUnsafeProjection: Code generated in 
119.90324 ms
Traceback (most recent call last):
  File "/tmp/bug.py", line 13, in 
print j.count()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
268, in count
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
line 538, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, 
in deco
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o148.count.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#10L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#13L])
   TungstenProject
BroadcastHashJoin [id#0], [id#8], BuildRight
 TungstenProject [id#0]
  Scan PhysicalRDD[id#0,foo#1]
 ConvertToUnsafe
  Scan ParquetRelation[hdfs:///tmp/1.parq][id#8]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:69)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:174)
at 
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at 
org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Note this happens only under following condition:
# executor memory >= 32GB (doesn't fail with up to 31 GB)
# the ID in the q dataframe has exactly 24 chars (doesn't fail with less or 
more then 24 chars)
# q is read from parquet



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-04-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-14615:
---

Assignee: DB Tsai

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-04-13 Thread DB Tsai (JIRA)

DB Tsai created SPARK-14615:
---

 Summary: Use the new ML Vector and Matrix in the ML pipeline based 
algorithms 
 Key: SPARK-14615
 URL: https://issues.apache.org/jira/browse/SPARK-14615
 Project: Spark
  Issue Type: Sub-task
Reporter: DB Tsai


Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14541:


Assignee: Apache Spark

> SQL function: IFNULL, NULLIF, NVL and NVL2
> --
>
> Key: SPARK-14541
> URL: https://issues.apache.org/jira/browse/SPARK-14541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> It will be great to have these SQL functions:
> IFNULL, NULLIF, NVL, NVL2
> The meaning of these functions could be found in oracle docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14541:


Assignee: (was: Apache Spark)

> SQL function: IFNULL, NULLIF, NVL and NVL2
> --
>
> Key: SPARK-14541
> URL: https://issues.apache.org/jira/browse/SPARK-14541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> It will be great to have these SQL functions:
> IFNULL, NULLIF, NVL, NVL2
> The meaning of these functions could be found in oracle docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240206#comment-15240206
 ] 

Apache Spark commented on SPARK-14541:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12373

> SQL function: IFNULL, NULLIF, NVL and NVL2
> --
>
> Key: SPARK-14541
> URL: https://issues.apache.org/jira/browse/SPARK-14541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> It will be great to have these SQL functions:
> IFNULL, NULLIF, NVL, NVL2
> The meaning of these functions could be found in oracle docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14614) Add `bound` function

2016-04-13 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-14614:
-

 Summary: Add `bound` function
 Key: SPARK-14614
 URL: https://issues.apache.org/jira/browse/SPARK-14614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun


This issue aims to add `bound` function (aka Banker's round) by extending 
current `round` implementation.

Hive supports `bround` since 1.3.0. [Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].

{code}
hive> select round(2.5), bround(2.5);
OK
3.0 2.0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14614) Add `bround` function

2016-04-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14614:
--
Summary: Add `bround` function  (was: Add `bound` function)

> Add `bround` function
> -
>
> Key: SPARK-14614
> URL: https://issues.apache.org/jira/browse/SPARK-14614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue aims to add `bound` function (aka Banker's round) by extending 
> current `round` implementation.
> Hive supports `bround` since 1.3.0. [Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF].
> {code}
> hive> select round(2.5), bround(2.5);
> OK
> 3.0   2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14613) Add @Since into the matrix and vector classes in spark-mllib-local

2016-04-13 Thread DB Tsai (JIRA)

DB Tsai created SPARK-14613:
---

 Summary: Add @Since into the matrix and vector classes in 
spark-mllib-local
 Key: SPARK-14613
 URL: https://issues.apache.org/jira/browse/SPARK-14613
 Project: Spark
  Issue Type: Sub-task
Reporter: DB Tsai


In spark-mllib-local, we're no longer to be able to use @Since annotation. As a 
result, we will switch to standard java doc style using /* @Since /*. This task 
will add those new APIs as @Since 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14612) Consolidate the version of dependencies in mllib and mllib-local into one place

2016-04-13 Thread DB Tsai (JIRA)

DB Tsai created SPARK-14612:
---

 Summary: Consolidate the version of dependencies in mllib and 
mllib-local into one place 
 Key: SPARK-14612
 URL: https://issues.apache.org/jira/browse/SPARK-14612
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: DB Tsai


Both spark-mllib-local and spark-mllib depend on breeze, but we specify the 
version of breeze in both pom files. Also, org.json4s has the same issue. For 
maintainability, we should define the version of dependencies in one place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14457) Write a end to end test for DataSet with UDT

2016-04-13 Thread Joan Goyeau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joan Goyeau closed SPARK-14457.
---
Resolution: Fixed

> Write a end to end test for DataSet with UDT
> 
>
> Key: SPARK-14457
> URL: https://issues.apache.org/jira/browse/SPARK-14457
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Reporter: Joan Goyeau
>Priority: Minor
>
> I don't know if UDTs are supported by DataSets yet but if yes we should write 
> at least a end to end test for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7861) Python wrapper for OneVsRest

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7861:
-
   Shepherd: Joseph K. Bradley
   Assignee: Xusen Yin  (was: Ram Sriharsha)
Component/s: ML
 Issue Type: New Feature  (was: Improvement)

> Python wrapper for OneVsRest
> 
>
> Key: SPARK-7861
> URL: https://issues.apache.org/jira/browse/SPARK-7861
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Ram Sriharsha
>Assignee: Xusen Yin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14611) Second attempt observed after AM fails due to max number of executor failure in first attempt

2016-04-13 Thread Kshitij Badani (JIRA)

Kshitij Badani created SPARK-14611:
--

 Summary: Second attempt observed after AM fails due to max number 
of executor failure in first attempt
 Key: SPARK-14611
 URL: https://issues.apache.org/jira/browse/SPARK-14611
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
 Environment: RHEL7 64 bit
Reporter: Kshitij Badani


I submitted a spark application in yarn-cluster mode. My cluster has two 
Nodemanagers. After submitting the spark application, I tried to restart 
Nodemanager on node1 actively running a few executor and this node was not 
running the AM. 

During the time when the Nodemanager was restarting, 3 of the executors running 
on node2 failed with 'failed to connect to external shuffle server' as follows

java.io.IOException: Failed to connect to node1
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:211)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:208)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:194)
at org.apache.spark.executor.Executor.(Executor.scala:86)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: node1
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

Each of the 3 executors tried to connect to external shuffle service 2 more 
times, all during the period when the NM on node1 was restarting and eventually 
failed

Since 3 executors failed, the AM exitted with FAILURE status and I can see 
following message in the application logs

INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max 
number of executor failures (3) reached)

After this, we saw a 2nd application attempt which succeeded as the NM had came 
up back.

Should we see a 2nd attempt in such scenarios where multiple executors have 
failed in the 1st attempt due to not being able to connect to external shuffle 
service? What if the 2nd attempt also fails due to similar reason, in that case 
it would be a heavy penalty?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14541) SQL function: IFNULL, NULLIF, NVL and NVL2

2016-04-13 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240157#comment-15240157
 ] 

Bo Meng commented on SPARK-14541:
-

I will try to do it one by one. 

> SQL function: IFNULL, NULLIF, NVL and NVL2
> --
>
> Key: SPARK-14541
> URL: https://issues.apache.org/jira/browse/SPARK-14541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>
> It will be great to have these SQL functions:
> IFNULL, NULLIF, NVL, NVL2
> The meaning of these functions could be found in oracle docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14607) Partition pruning is case sensitive even with HiveContext

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240130#comment-15240130
 ] 

Apache Spark commented on SPARK-14607:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12371

> Partition pruning is case sensitive even with HiveContext
> -
>
> Key: SPARK-14607
> URL: https://issues.apache.org/jira/browse/SPARK-14607
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>
> It should not be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14484:


Assignee: Apache Spark

> Fail to create parquet filter if the column name does not match exactly
> ---
>
> Key: SPARK-14484
> URL: https://issues.apache.org/jira/browse/SPARK-14484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> There will be exception about "no key found" from 
> ParquetFilters.createFilter()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14484:


Assignee: (was: Apache Spark)

> Fail to create parquet filter if the column name does not match exactly
> ---
>
> Key: SPARK-14484
> URL: https://issues.apache.org/jira/browse/SPARK-14484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> There will be exception about "no key found" from 
> ParquetFilters.createFilter()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14607) Partition pruning is case sensitive even with HiveContext

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14607:


Assignee: Apache Spark

> Partition pruning is case sensitive even with HiveContext
> -
>
> Key: SPARK-14607
> URL: https://issues.apache.org/jira/browse/SPARK-14607
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> It should not be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14484) Fail to create parquet filter if the column name does not match exactly

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240131#comment-15240131
 ] 

Apache Spark commented on SPARK-14484:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12371

> Fail to create parquet filter if the column name does not match exactly
> ---
>
> Key: SPARK-14484
> URL: https://issues.apache.org/jira/browse/SPARK-14484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> There will be exception about "no key found" from 
> ParquetFilters.createFilter()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240127#comment-15240127
 ] 

Joseph K. Bradley commented on SPARK-7146:
--

I just did an audit of our current shared params.  Does the updated proposal 
sound good?

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7146:
-
Description: 
Discussion: Should the Param traits in sharedParams.scala be public?

Pros:
* Sharing the Param traits helps to encourage standardized Param names and 
documentation.

Cons:
* Users have to be careful since parameters can have different meanings for 
different algorithms.
* If the shared Params are public, then implementations could test for the 
traits.  It is unclear if we want users to rely on these traits, which are 
somewhat experimental.

Currently, the shared params are private.

h3. UPDATED proposal
* Some Params are clearly safe to make public.  We will do so.
* Some Params could be made public but may require caveats in the trait doc.
* Some Params have turned out not to be shared in practice.  We can move those 
Params to the classes which use them.

*Public shared params*:

* I/O column params
** HasFeaturesCol
** HasInputCol
** HasInputCols
** HasLabelCol
** HasOutputCol
** HasPredictionCol
** HasProbabilityCol
** HasRawPredictionCol
** HasVarianceCol
** HasWeightCol

* Algorithm settings
** HasCheckpointInterval
** HasElasticNetParam
** HasFitIntercept
** HasMaxIter
** HasRegParam
** HasSeed
** HasStandardization (less common)
** HasStepSize
** HasTol

*Questionable params*:
* HasHandleInvalid (only used in StringIndexer, but might be more widely used 
later on)
* HasSolver (used in LinearRegression and GeneralizedLinearRegression, but same 
meaning as Optimizer in LDA)

*Params to be removed from sharedParams*:
* HasThreshold (only used in LogisticRegression)
* HasThresholds (only used in ProbabilisticClassifier)


  was:
Discussion: Should the Param traits in sharedParams.scala be public?

Pros:
* Sharing the Param traits helps to encourage standardized Param names and 
documentation.

Cons:
* Users have to be careful since parameters can have different meanings for 
different algorithms.
* If the shared Params are public, then implementations could test for the 
traits.  It is unclear if we want users to rely on these traits, which are 
somewhat experimental.

Currently, the shared params are private.

Proposal: Either
(a) make the shared params private to encourage users to write specialized 
documentation and value checks for parameters, or
(b) design a better way to encourage overriding documentation and parameter 
value checks


> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14599) BaggedPoint should support weighted instances.

2016-04-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240115#comment-15240115
 ] 

Apache Spark commented on SPARK-14599:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12370

> BaggedPoint should support weighted instances.
> --
>
> Key: SPARK-14599
> URL: https://issues.apache.org/jira/browse/SPARK-14599
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> This JIRA addresses a TODO in bagged point to support individual sample 
> weights. This is a blocker for 
> [SPARK-9478|https://issues.apache.org/jira/browse/SPARK-9478]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14599) BaggedPoint should support weighted instances.

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14599:


Assignee: (was: Apache Spark)

> BaggedPoint should support weighted instances.
> --
>
> Key: SPARK-14599
> URL: https://issues.apache.org/jira/browse/SPARK-14599
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> This JIRA addresses a TODO in bagged point to support individual sample 
> weights. This is a blocker for 
> [SPARK-9478|https://issues.apache.org/jira/browse/SPARK-9478]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14599) BaggedPoint should support weighted instances.

2016-04-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14599:


Assignee: Apache Spark

> BaggedPoint should support weighted instances.
> --
>
> Key: SPARK-14599
> URL: https://issues.apache.org/jira/browse/SPARK-14599
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> This JIRA addresses a TODO in bagged point to support individual sample 
> weights. This is a blocker for 
> [SPARK-9478|https://issues.apache.org/jira/browse/SPARK-9478]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Seth Hendrickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-14610:
-
Description: 
Currently, the method findSplitsForContinuousFeature in random forest produces 
an unnecessary split. For example, if a continuous feature has unique values: 
(1, 2, 3), then the possible splits generated by this method are:
* {1|2,3}
* {1,2|3} 
* {1,2,3|}

The following unit test is quite clearly incorrect:

{code:title=rf.scala|borderStyle=solid}
val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
  val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, 
fakeMetadata, 0)
  assert(splits.length === 3)
{code}

  was:
Currently, the method findSplitsForContinuousFeature in random forest produces 
an unnecessary split. For example, if a continuous feature has unique values: 
{1, 2, 3}, then the possible splits generated by this method are:
{1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly 
incorrect:

{code:title=rf.scala|borderStyle=solid}
val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
  val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, 
fakeMetadata, 0)
  assert(splits.length === 3)
{code}


> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: (1, 2, 3), then the possible splits generated by this method 
> are:
> * {1|2,3}
> * {1,2|3} 
> * {1,2,3|}
> The following unit test is quite clearly incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240105#comment-15240105
 ] 

Seth Hendrickson commented on SPARK-14610:
--

One thing to note, is that fixing this actually uncovers a bug of sorts. There 
is an assertion in this method to verify that there are more than zero splits. 
However, due to the extra split being returned previously, this assertion did 
nothing. Now, the training will fail if there is a constant continuous feature. 
So, this PR will also remove this assertion and handle constant continuous 
features appropriately.

I can submit a PR for this soon.

> Remove superfluous split from random forest findSplitsForContinousFeature
> -
>
> Key: SPARK-14610
> URL: https://issues.apache.org/jira/browse/SPARK-14610
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently, the method findSplitsForContinuousFeature in random forest 
> produces an unnecessary split. For example, if a continuous feature has 
> unique values: {1, 2, 3}, then the possible splits generated by this method 
> are:
> {1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly 
> incorrect:
> {code:title=rf.scala|borderStyle=solid}
> val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
>   val splits = 
> RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
>   assert(splits.length === 3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14610) Remove superfluous split from random forest findSplitsForContinousFeature

2016-04-13 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-14610:


 Summary: Remove superfluous split from random forest 
findSplitsForContinousFeature
 Key: SPARK-14610
 URL: https://issues.apache.org/jira/browse/SPARK-14610
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson


Currently, the method findSplitsForContinuousFeature in random forest produces 
an unnecessary split. For example, if a continuous feature has unique values: 
{1, 2, 3}, then the possible splits generated by this method are:
{1|2,3}, {1,2|3} and {1,2,3|}. The following unit test is quite clearly 
incorrect:

{code:title=rf.scala|borderStyle=solid}
val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
  val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, 
fakeMetadata, 0)
  assert(splits.length === 3)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14574) Pure Java modules should not have _2.xx suffixes in their package names

2016-04-13 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14574.

Resolution: Later

Resolving as "later" since this is prohibitively costly to fix and I don't have 
time to finish it.

> Pure Java modules should not have _2.xx suffixes in their package names
> ---
>
> Key: SPARK-14574
> URL: https://issues.apache.org/jira/browse/SPARK-14574
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark has a few modules which do not depend on Scala, such as spark-launcher, 
> unsafe, sketch, and the network libraries. However, we currently cross build 
> and publish these artifacts for different Scala versions.
> We should refactor our build so that pure-Java modules can be published 
> without Scala versions appearing in their artifact names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14609) LOAD DATA

2016-04-13 Thread Yin Huai (JIRA)

Yin Huai created SPARK-14609:


 Summary: LOAD DATA
 Key: SPARK-14609
 URL: https://issues.apache.org/jira/browse/SPARK-14609
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


For load command, it should be pretty easy to implement. We already call Hive 
to load data when insert into Hive tables. So, we can follow the implementation 
of that. For example, we load into hive table in InsertIntoHiveTable command at 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L221-L225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9961:
-
Target Version/s: 2.0.0  (was: )

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to use other evaluators or 
> evaluator options.
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9961:
-
Description: 
Predictor and PredictionModel should have abstract defaultEvaluator methods 
which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
all provide natural evaluators, set to use the correct input columns and 
metrics.  Concrete classes may later be modified to use other evaluators or 
evaluator options.

The initial implementation should be marked as DeveloperApi since we may need 
to change the defaults later on.

  was:
Predictor and PredictionModel should have abstract defaultEvaluator methods 
which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
all provide natural evaluators, set to use the correct input columns and 
metrics.  Concrete classes may later be modified to 

The initial implementation should be marked as DeveloperApi since we may need 
to change the defaults later on.


> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to use other evaluators or 
> evaluator options.
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14606) Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240058#comment-15240058
 ] 

Joseph K. Bradley commented on SPARK-14606:
---

We should choose a good way to support this without breaking the current API.  
Here's what I propose:
* maxBins: Keep this Param, and maintain close to the current behavior.  We can 
still use it to control discretization of continuous features, as well as 
deciding when to treat categorical features as ordered/unordered.
* maxCategories: New Param which sets the maximum number of categories, merely 
as a limit to keep the algorithm from blowing up.

Both can be marked as expertParams so that most users will know not to bother 
with them.

How does that sound?

> Different maxBins value for categorical and continuous features in 
> RandomForest implementation.
> ---
>
> Key: SPARK-14606
> URL: https://issues.apache.org/jira/browse/SPARK-14606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Rahul Tanwani
>Priority: Minor
>
> Currently the RandomForest algo takes a single maxBins value to decide the 
> number of splits to take. This sometimes causes training time to go very high 
> when there is a single categorical column having sufficiently large number of 
> unique values. This single column impacts all the numeric (continuous) 
> columns even though such a high number of splits are not required. 
> Encoding the  categorical column into features make the data very wide and 
> this requires us to increase the maxMemoryInMB and puts more pressure on the 
> GC as well. 
> Keeping the separate maxBins values for categorial and continuous features 
> should be useful in this regard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14606) Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14606:
--
Fix Version/s: (was: 2.0.0)

> Different maxBins value for categorical and continuous features in 
> RandomForest implementation.
> ---
>
> Key: SPARK-14606
> URL: https://issues.apache.org/jira/browse/SPARK-14606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Rahul Tanwani
>Priority: Minor
>
> Currently the RandomForest algo takes a single maxBins value to decide the 
> number of splits to take. This sometimes causes training time to go very high 
> when there is a single categorical column having sufficiently large number of 
> unique values. This single column impacts all the numeric (continuous) 
> columns even though such a high number of splits are not required. 
> Encoding the  categorical column into features make the data very wide and 
> this requires us to increase the maxMemoryInMB and puts more pressure on the 
> GC as well. 
> Keeping the separate maxBins values for categorial and continuous features 
> should be useful in this regard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14606) Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14606:
--
Affects Version/s: (was: 1.6.1)
   (was: 1.5.2)
   (was: 1.6.0)

> Different maxBins value for categorical and continuous features in 
> RandomForest implementation.
> ---
>
> Key: SPARK-14606
> URL: https://issues.apache.org/jira/browse/SPARK-14606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Rahul Tanwani
>Priority: Minor
>
> Currently the RandomForest algo takes a single maxBins value to decide the 
> number of splits to take. This sometimes causes training time to go very high 
> when there is a single categorical column having sufficiently large number of 
> unique values. This single column impacts all the numeric (continuous) 
> columns even though such a high number of splits are not required. 
> Encoding the  categorical column into features make the data very wide and 
> this requires us to increase the maxMemoryInMB and puts more pressure on the 
> GC as well. 
> Keeping the separate maxBins values for categorial and continuous features 
> should be useful in this regard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240048#comment-15240048
 ] 

Joseph K. Bradley commented on SPARK-10574:
---

[~yanboliang] Will you have time to work on this?  Thanks!

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14606) Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14606:
--
Shepherd:   (was: Xiangrui Meng)

> Different maxBins value for categorical and continuous features in 
> RandomForest implementation.
> ---
>
> Key: SPARK-14606
> URL: https://issues.apache.org/jira/browse/SPARK-14606
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Rahul Tanwani
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently the RandomForest algo takes a single maxBins value to decide the 
> number of splits to take. This sometimes causes training time to go very high 
> when there is a single categorical column having sufficiently large number of 
> unique values. This single column impacts all the numeric (continuous) 
> columns even though such a high number of splits are not required. 
> Encoding the  categorical column into features make the data very wide and 
> this requires us to increase the maxMemoryInMB and puts more pressure on the 
> GC as well. 
> Keeping the separate maxBins values for categorial and continuous features 
> should be useful in this regard. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 261 matches

Mail list logo