date:20170203

[jira] [Commented] (SPARK-18872) New test cases for EXISTS subquery

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852674#comment-15852674
 ] 

Apache Spark commented on SPARK-18872:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/16802

> New test cases for EXISTS subquery
> --
>
> Key: SPARK-18872
> URL: https://issues.apache.org/jira/browse/SPARK-18872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: Dilip Biswal
> Fix For: 2.2.0
>
>
> This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test 
> cases. It follows the same idea as the IN subquery test cases which contain 
> simple patterns, then build more complex constructs in both parent and 
> subquery sides. This batch of test cases are mostly, if not all, positive 
> test cases that do not return any syntax errors or unsupported functionality. 
> We make effort to have test cases returning rows in the result set so that 
> they can indirectly detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19446) Remove unused findTightestCommonType in TypeCoercion

2017-02-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19446:
---

Assignee: Hyukjin Kwon

> Remove unused findTightestCommonType in TypeCoercion
> 
>
> Key: SPARK-19446
> URL: https://issues.apache.org/jira/browse/SPARK-19446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> It seems the codes below is not used anymore.
> {code}
>   /**
>* Find the tightest common type of a set of types by continuously applying
>* `findTightestCommonTypeOfTwo` on these types.
>*/
>   private def findTightestCommonType(types: Seq[DataType]): Option[DataType] 
> = {
> types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
>   case None => None
>   case Some(d) => findTightestCommonTypeOfTwo(d, c)
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19446) Remove unused findTightestCommonType in TypeCoercion

2017-02-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19446.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16786
[https://github.com/apache/spark/pull/16786]

> Remove unused findTightestCommonType in TypeCoercion
> 
>
> Key: SPARK-19446
> URL: https://issues.apache.org/jira/browse/SPARK-19446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.2.0
>
>
> It seems the codes below is not used anymore.
> {code}
>   /**
>* Find the tightest common type of a set of types by continuously applying
>* `findTightestCommonTypeOfTwo` on these types.
>*/
>   private def findTightestCommonType(types: Seq[DataType]): Option[DataType] 
> = {
> types.foldLeft[Option[DataType]](Some(NullType))((r, c) => r match {
>   case None => None
>   case Some(d) => findTightestCommonTypeOfTwo(d, c)
> })
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19428) Ability to select first row of groupby

2017-02-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852582#comment-15852582
 ] 

Takeshi Yamamuro commented on SPARK-19428:
--

Thanks for the explanation! "df.select($"group").distinct()" is not enough for 
the your case?

> Ability to select first row of groupby
> --
>
> Key: SPARK-19428
> URL: https://issues.apache.org/jira/browse/SPARK-19428
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Luke Miner
>Priority: Minor
>
> It would be nice to be able to select the first row from {{GroupedData}}. 
> Pandas has something like this:
> {{df.groupby('group').first()}}
> It's especially handy if you can order the group as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19353) Support binary I/O in PipedRDD

2017-02-03 Thread Sergei Lebedev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852498#comment-15852498
 ] 

Sergei Lebedev commented on SPARK-19353:


For reference: we have a fully backward-compatible 
[implementation|https://github.com/criteo-forks/spark/pull/26] of binary 
PipedRDD in our GitHub fork.

> Support binary I/O in PipedRDD
> --
>
> Key: SPARK-19353
> URL: https://issues.apache.org/jira/browse/SPARK-19353
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sergei Lebedev
>Priority: Minor
>
> The current design of RDD.pipe is very restrictive. 
> It is line-based, each element of the input RDD [gets 
> serialized|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L143]
>  into one or more lines. Similarly for the output of the child process, one 
> line corresponds to a single element of the output RDD. 
> It allows to customize the output format via {{printRDDElement}}, but not the 
> input format.
> It is not designed for extensibility. The only way to get a "BinaryPipedRDD" 
> is to copy/paste most of it and change the relevant parts.
> These limitations have been discussed on 
> [SO|http://stackoverflow.com/questions/27986830/how-to-pipe-binary-data-in-apache-spark]
>  and the mailing list, but alas no issue has been created.
> A possible solution to at least the first two limitations is to factor out 
> the format into a separate object (or objects). For instance, {{InputWriter}} 
> and {{OutputReader}}, following Hadoop streaming API. 
> {code}
> trait InputWriter[T] {
> def write(os: OutputStream, elem: T)
> }
> trait OutputReader[T] {
> def read(is: InputStream): T
> }
> {code}
> The default configuration would be to write and read in line-based format, 
> but the users will also be able to selectively swap those to the appropriate 
> implementations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13619) Jobs page UI shows wrong number of failed tasks

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13619:


Assignee: Apache Spark

> Jobs page UI shows wrong number of failed tasks
> ---
>
> Key: SPARK-13619
> URL: https://issues.apache.org/jira/browse/SPARK-13619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Minor
>
> In Master and History Server UI's, Jobs page shows the wrong number of failed 
> tasks.
> http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/
> h3. Completed Jobs (1)
> ||Job Id||Description||   Submitted|| Duration||  Stages: 
> Succeeded/Total||   Tasks (for all stages): Succeeded/Total||
> |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s |  
> 2/2 | 100/100 (2 failed)|
> \\
> \\
> When we go to the Job details page, we can see different number for failed 
> tasks and It is the correct number based on the failed tasks.
> http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0
> h3. Completed Stages (2)
> ||Stage Id||  Description||   Submitted|| Duration||  Tasks: 
> Succeeded/Total||Input|| Output||Shuffle Read||  Shuffle 
> Write||
> |1|   saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51|
> 1 s|50/50 (6 failed)|   |7.6 KB|371.0 KB|   |
> |0|   mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s|   
> 50/50|  1521.7 MB|  |   |   371.0 KB|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13619) Jobs page UI shows wrong number of failed tasks

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852470#comment-15852470
 ] 

Apache Spark commented on SPARK-13619:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/16801

> Jobs page UI shows wrong number of failed tasks
> ---
>
> Key: SPARK-13619
> URL: https://issues.apache.org/jira/browse/SPARK-13619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Devaraj K
>Priority: Minor
>
> In Master and History Server UI's, Jobs page shows the wrong number of failed 
> tasks.
> http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/
> h3. Completed Jobs (1)
> ||Job Id||Description||   Submitted|| Duration||  Stages: 
> Succeeded/Total||   Tasks (for all stages): Succeeded/Total||
> |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s |  
> 2/2 | 100/100 (2 failed)|
> \\
> \\
> When we go to the Job details page, we can see different number for failed 
> tasks and It is the correct number based on the failed tasks.
> http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0
> h3. Completed Stages (2)
> ||Stage Id||  Description||   Submitted|| Duration||  Tasks: 
> Succeeded/Total||Input|| Output||Shuffle Read||  Shuffle 
> Write||
> |1|   saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51|
> 1 s|50/50 (6 failed)|   |7.6 KB|371.0 KB|   |
> |0|   mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s|   
> 50/50|  1521.7 MB|  |   |   371.0 KB|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13619) Jobs page UI shows wrong number of failed tasks

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13619:


Assignee: (was: Apache Spark)

> Jobs page UI shows wrong number of failed tasks
> ---
>
> Key: SPARK-13619
> URL: https://issues.apache.org/jira/browse/SPARK-13619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Devaraj K
>Priority: Minor
>
> In Master and History Server UI's, Jobs page shows the wrong number of failed 
> tasks.
> http://X.X.X.X:8080/history/app-20160303024135-0001/jobs/
> h3. Completed Jobs (1)
> ||Job Id||Description||   Submitted|| Duration||  Stages: 
> Succeeded/Total||   Tasks (for all stages): Succeeded/Total||
> |0 | saveAsTextFile at PipeLineTest.java:52| 2016/03/03 02:41:36 |16 s |  
> 2/2 | 100/100 (2 failed)|
> \\
> \\
> When we go to the Job details page, we can see different number for failed 
> tasks and It is the correct number based on the failed tasks.
> http://x.x.x.x:8080/history/app-20160303024135-0001/jobs/job/?id=0
> h3. Completed Stages (2)
> ||Stage Id||  Description||   Submitted|| Duration||  Tasks: 
> Succeeded/Total||Input|| Output||Shuffle Read||  Shuffle 
> Write||
> |1|   saveAsTextFile at PipeLineTest.java:52 +details|2016/03/03 02:41:51|
> 1 s|50/50 (6 failed)|   |7.6 KB|371.0 KB|   |
> |0|   mapToPair at PipeLineTest.java:29 +details|2016/03/03 02:41:36| 15 s|   
> 50/50|  1521.7 MB|  |   |   371.0 KB|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19456) Add LinearSVC R API

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19456:


Assignee: Apache Spark

> Add LinearSVC R API
> ---
>
> Key: SPARK-19456
> URL: https://issues.apache.org/jira/browse/SPARK-19456
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> Linear SVM classifier is newly added into ML and python API has been added. 
> This JIRA is to add R side API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19456) Add LinearSVC R API

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852450#comment-15852450
 ] 

Apache Spark commented on SPARK-19456:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16800

> Add LinearSVC R API
> ---
>
> Key: SPARK-19456
> URL: https://issues.apache.org/jira/browse/SPARK-19456
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>
> Linear SVM classifier is newly added into ML and python API has been added. 
> This JIRA is to add R side API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852449#comment-15852449
 ] 

Apache Spark commented on SPARK-19386:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16799

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Krishna Kalyan
> Fix For: 2.2.0
>
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19456) Add LinearSVC R API

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19456:


Assignee: (was: Apache Spark)

> Add LinearSVC R API
> ---
>
> Key: SPARK-19456
> URL: https://issues.apache.org/jira/browse/SPARK-19456
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>
> Linear SVM classifier is newly added into ML and python API has been added. 
> This JIRA is to add R side API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19456) Add LinearSVC R API

2017-02-03 Thread Miao Wang (JIRA)

Miao Wang created SPARK-19456:
-

 Summary: Add LinearSVC R API
 Key: SPARK-19456
 URL: https://issues.apache.org/jira/browse/SPARK-19456
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Miao Wang


Linear SVM classifier is newly added into ML and python API has been added. 
This JIRA is to add R side API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18873) New test cases for scalar subquery

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852439#comment-15852439
 ] 

Apache Spark commented on SPARK-18873:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/16798

> New test cases for scalar subquery
> --
>
> Key: SPARK-18873
> URL: https://issues.apache.org/jira/browse/SPARK-18873
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>
> This JIRA is for submitting a PR for new test cases on scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor

2017-02-03 Thread Suresh Avadhanula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Avadhanula updated SPARK-19418:
--
Priority: Minor  (was: Major)

> Dataset generated java code fails to compile as java.lang.Long does not 
> accept UTF8String in constructor
> 
>
> Key: SPARK-19418
> URL: https://issues.apache.org/jira/browse/SPARK-19418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Suresh Avadhanula
>Priority: Minor
> Attachments: encodertest.zip
>
>
> I have the following in Java spark driver. 
> DealerPerson module object is
> {code:title=DealerPerson.java|borderStyle=solid}
> public class DealerPerson
> {
> Long schemaOrgUnitId ;
> List personList 
> }
> {code}
> I populate it using group by as follows.
> {code}
>   Dataset dps = persondds.groupByKey(new MapFunction Long>() {
> @Override
> public Long call(Person person) throws Exception {
> return person.getSchemaOrgUnitId();
> }
> }, Encoders.LONG()).
> mapGroups(new MapGroupsFunction() 
> {
> @Override
> public DealerPerson call(Long dp, 
> java.util.Iterator iterator) throws Exception {
> DealerPerson retDp = new DealerPerson();
> retDp.setSchemaOrgUnitId(dp);
> ArrayList persons = new ArrayList();
> while (iterator.hasNext())
> persons.add(iterator.next());
> retDp.setPersons(persons);
> return retDp;
> }
> }, Encoders.bean(DealerPerson.class));
> {code}
> The generated code throws compiler exception since UTF8String is 
> java.lang.Long() 
> {noformat}
> 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes)
> 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
> 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 58: No applicable constructor/method found for actual parameters 
> "org.apache.spark.unsafe.types.UTF8String"; candidates are: 
> "java.lang.Long(long)", "java.lang.Long(java.lang.String)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private InternalRow mutableRow;
> /* 009 */   private com.xtime.spark.model.Person javaBean;
> /* 010 */   private boolean resultIsNull;
> /* 011 */   private UTF8String argValue;
> /* 012 */   private boolean resultIsNull1;
> /* 013 */   private UTF8String argValue1;
> /* 014 */   private boolean resultIsNull2;
> /* 015 */   private UTF8String argValue2;
> /* 016 */   private boolean resultIsNull3;
> /* 017 */   private UTF8String argValue3;
> /* 018 */   private boolean resultIsNull4;
> /* 019 */   private UTF8String argValue4;
> /* 020 */
> /* 021 */   public SpecificSafeProjection(Object[] references) {
> /* 022 */ this.references = references;
> /* 023 */ mutableRow = (InternalRow) references[references.length - 1];
> /* 024 */
> /* 025 */
> /* 026 */
> /* 027 */
> /* 028 */
> /* 029 */
> /* 030 */
> /* 031 */
> /* 032 */
> /* 033 */
> /* 034 */
> /* 035 */
> /* 036 */   }
> /* 037 */
> /* 038 */   public void initialize(int partitionIndex) {
> /* 039 */
> /* 040 */   }
> /* 041 */
> /* 042 */
> /* 043 */   private void apply_4(InternalRow i) {
> /* 044 */
> /* 045 */
> /* 046 */ resultIsNull1 = false;
> /* 047 */ if (!resultIsNull1) {
> /* 048 */
> /* 049 */   boolean isNull21 = i.isNullAt(17);
> /* 050 */   UTF8String value21 = isNull21 ? null : (i.getUTF8String(17));
> /* 051 */   resultIsNull1 = isNull21;
> /* 052 */   argValue1 = value21;
> /* 053 */ }
> /* 054 */
> /* 055 */
> /* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new 
> java.lang.Long(argValue1);
> /* 057 */ javaBean.setSchemaOrgUnitId(value20);
> /* 058 */
> /* 059 */
> /* 060 */ resultIsNull2 = false;
> /* 061 */ if (!resultIsNull2) {
> /* 062 */
> /* 063 */   boolean isNull23 = i.isNullAt(0);
> /* 064 */   UTF8String value23 = isNull23 ? null : (i.getUTF8String(0));
> /* 065 */   resultIsNull2 = isNull23;
> /* 066 */   argValue2 = value23;
> /* 067 */ }
> /* 068 */
> /* 069 */
> /* 070 */ fina

[jira] [Commented] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor

2017-02-03 Thread Suresh Avadhanula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852416#comment-15852416
 ] 

Suresh Avadhanula commented on SPARK-19418:
---

Figure out the "workaround".
Solution is based on [StackOverflow thread  | 
http://stackoverflow.com/questions/41995843/spark-compileexception-in-dataset-groupbykey
 ].

The following works 

{code:title Reading from CSV}
 Dataset  personDs = sqlContext
.read()
.option("header", true)
.option("inferSchema", true)
.csv("person10k.csv")
.as(Encoders.bean(Person.class));
personDs.printSchema();
personDs.show(10);
{code}

option("inferSchema", true) and Encoders.bean() seem to be counter intuitive. 



> Dataset generated java code fails to compile as java.lang.Long does not 
> accept UTF8String in constructor
> 
>
> Key: SPARK-19418
> URL: https://issues.apache.org/jira/browse/SPARK-19418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Suresh Avadhanula
> Attachments: encodertest.zip
>
>
> I have the following in Java spark driver. 
> DealerPerson module object is
> {code:title=DealerPerson.java|borderStyle=solid}
> public class DealerPerson
> {
> Long schemaOrgUnitId ;
> List personList 
> }
> {code}
> I populate it using group by as follows.
> {code}
>   Dataset dps = persondds.groupByKey(new MapFunction Long>() {
> @Override
> public Long call(Person person) throws Exception {
> return person.getSchemaOrgUnitId();
> }
> }, Encoders.LONG()).
> mapGroups(new MapGroupsFunction() 
> {
> @Override
> public DealerPerson call(Long dp, 
> java.util.Iterator iterator) throws Exception {
> DealerPerson retDp = new DealerPerson();
> retDp.setSchemaOrgUnitId(dp);
> ArrayList persons = new ArrayList();
> while (iterator.hasNext())
> persons.add(iterator.next());
> retDp.setPersons(persons);
> return retDp;
> }
> }, Encoders.bean(DealerPerson.class));
> {code}
> The generated code throws compiler exception since UTF8String is 
> java.lang.Long() 
> {noformat}
> 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes)
> 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
> 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 58: No applicable constructor/method found for actual parameters 
> "org.apache.spark.unsafe.types.UTF8String"; candidates are: 
> "java.lang.Long(long)", "java.lang.Long(java.lang.String)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private InternalRow mutableRow;
> /* 009 */   private com.xtime.spark.model.Person javaBean;
> /* 010 */   private boolean resultIsNull;
> /* 011 */   private UTF8String argValue;
> /* 012 */   private boolean resultIsNull1;
> /* 013 */   private UTF8String argValue1;
> /* 014 */   private boolean resultIsNull2;
> /* 015 */   private UTF8String argValue2;
> /* 016 */   private boolean resultIsNull3;
> /* 017 */   private UTF8String argValue3;
> /* 018 */   private boolean resultIsNull4;
> /* 019 */   private UTF8String argValue4;
> /* 020 */
> /* 021 */   public SpecificSafeProjection(Object[] references) {
> /* 022 */ this.references = references;
> /* 023 */ mutableRow = (InternalRow) references[references.length - 1];
> /* 024 */
> /* 025 */
> /* 026 */
> /* 027 */
> /* 028 */
> /* 029 */
> /* 030 */
> /* 031 */
> /* 032 */
> /* 033 */
> /* 034 */
> /* 035 */
> /* 036 */   }
> /* 037 */
> /* 038 */   public void initialize(int partitionIndex) {
> /* 039 */
> /* 040 */   }
> /* 041 */
> /* 042 */
> /* 043 */   private void apply_4(InternalRow i) {
> /* 044 */
> /* 045 */
> /* 046 */ resultIsNull1 = false;
> /* 047 */ if (!resultIsNull1) {
> /* 048 */
> /* 049 */   boolean isNull21 = i.isNullAt(17);
> /* 050 */   UTF8String value21 = isNull21 ? null : (i.getUTF8String(17));
> /* 051 */   resultIsNull1 = isNull21;
> /* 052 */   argValue1 = valu

[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852242#comment-15852242
 ] 

Apache Spark commented on SPARK-19455:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/16797

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19455:


Assignee: (was: Apache Spark)

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19455:


Assignee: Apache Spark

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Apache Spark
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-03 Thread Adam Budde (JIRA)

Adam Budde created SPARK-19455:
--

 Summary: Add option for case-insensitive Parquet field resolution
 Key: SPARK-19455
 URL: https://issues.apache.org/jira/browse/SPARK-19455
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Adam Budde


[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
schema inferrence from the HiveMetastoreCatalog class when converting a 
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
favor of simply using the schema returend by the metastore. This results in an 
optimization as the underlying file status no longer need to be resolved until 
after the partition pruning step, reducing the number of files to be touched 
significantly in some cases. The downside is that the data schema used may no 
longer match the underlying file schema for case-sensitive formats such as 
Parquet.

This change initially included a [patch to 
ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
 that attempted to remedy this conflict by using a case-insentive fallback 
mapping when resolving field names during the schema clipping step. 
[SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
this patch after 
[SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
for embedding a case-sensitive schema as a Hive Metastore table property. AFAIK 
the assumption here was that the data schema obtained from the Metastore table 
property will be case sensitive and should match the Parquet schema exactly.

The problem arises when dealing with Parquet-backed tables for which this 
schema has not been embedded as a table attributes and for which the underlying 
files contain case-sensitive field names. This will happen for any Hive table 
that was not created by Spark or created by a version prior to 2.1.0. We've 
seen Spark SQL return no results for any query containing a case-sensitive 
field name for such tables.

The change we're proposing is to introduce a configuration parameter that will 
re-enable case-insensitive field name resolution in ParquetReadSupport. This 
option will also disable filter push-down for Parquet, as the filter predicate 
constructed by Spark SQL contains the case-insensitive field names which 
Parquet will return 0 records for when filtering against a case-sensitive 
column name. I was hoping to find a way to construct the filter on-the-fly in 
ParquetReadSupport but Parquet doesn't propegate the Configuration object 
passed to this class to the underlying InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852170#comment-15852170
 ] 

Apache Spark commented on SPARK-10063:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16796

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2017-02-03 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-17161:
---

 Assignee: Bryan Cutler
Affects Version/s: 2.2.0

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.2.0
>
>
> Often in Spark ML, there are classes that use a Scala Array in a constructor. 
>  In order to add the same API to Python, a Java-friendly alternate 
> constructor needs to exist to be compatible with py4j when converting from a 
> list.  This is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala Array currently 
> used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852090#comment-15852090
 ] 

Apache Spark commented on SPARK-19409:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16795

> Upgrade Parquet to 1.8.2
> 
>
> Key: SPARK-19409
> URL: https://issues.apache.org/jira/browse/SPARK-19409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Apache Parquet 1.8.2 is released officially last week on 26 Jan.
> This issue aims to bump Parquet version to 1.8.2 since it includes many fixes.
> https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-03 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852085#comment-15852085
 ] 

Kay Ousterhout commented on SPARK-19326:


I see that makes sense; thanks for the additional explanation.  [~andrewor14] 
did you think about this issue when implementing dynamic allocation originally? 
I noticed there'a a [comment saying that speculation is not considered for 
simplicity](https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L579),
 but it does seem like this functionality can prevent speculation from 
occurring.

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19452:


Assignee: (was: Apache Spark)

> Fix bug in the name assignment method in SparkR
> ---
>
> Key: SPARK-19452
> URL: https://issues.apache.org/jira/browse/SPARK-19452
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Wayne Zhang
>
> The names method fails to check for validity of the assignment values. This 
> can be fixed by calling colnames within names. See example below.
> {code}
> df <- suppressWarnings(createDataFrame(iris))
> # this is error
> colnames(df) <- NULL
> # this should report error
> names(df) <- NULL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19452:


Assignee: Apache Spark

> Fix bug in the name assignment method in SparkR
> ---
>
> Key: SPARK-19452
> URL: https://issues.apache.org/jira/browse/SPARK-19452
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>
> The names method fails to check for validity of the assignment values. This 
> can be fixed by calling colnames within names. See example below.
> {code}
> df <- suppressWarnings(createDataFrame(iris))
> # this is error
> colnames(df) <- NULL
> # this should report error
> names(df) <- NULL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852069#comment-15852069
 ] 

Apache Spark commented on SPARK-19452:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16794

> Fix bug in the name assignment method in SparkR
> ---
>
> Key: SPARK-19452
> URL: https://issues.apache.org/jira/browse/SPARK-19452
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Wayne Zhang
>
> The names method fails to check for validity of the assignment values. This 
> can be fixed by calling colnames within names. See example below.
> {code}
> df <- suppressWarnings(createDataFrame(iris))
> # this is error
> colnames(df) <- NULL
> # this should report error
> names(df) <- NULL
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-02-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-19386:


Assignee: Krishna Kalyan  (was: Miao Wang)

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Krishna Kalyan
> Fix For: 2.2.0
>
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-02-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19386.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Krishna Kalyan
> Fix For: 2.2.0
>
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19454) Improve DataFrame.replace API

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19454:


Assignee: (was: Apache Spark)

> Improve DataFrame.replace API
> -
>
> Key: SPARK-19454
> URL: https://issues.apache.org/jira/browse/SPARK-19454
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation suffers from following issues:
> - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use 
> {{None}} as the value {{value}} (although it is ignored). This requires 
> passing "magic" values:
> {code}
> df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
> df.replace({"Alice": "Bob"}, 1)
> {code}
> - Code doesn't check if provided types are correct. This can lead to 
> exception in Py4j (harder to diagnose):
> {code}
>  df.replace({"Alice": 1}, 1)
> {code}
> or silent failures (with bundled Py4j version):
> {code}
>  df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19454) Improve DataFrame.replace API

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19454:


Assignee: Apache Spark

> Improve DataFrame.replace API
> -
>
> Key: SPARK-19454
> URL: https://issues.apache.org/jira/browse/SPARK-19454
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Current implementation suffers from following issues:
> - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use 
> {{None}} as the value {{value}} (although it is ignored). This requires 
> passing "magic" values:
> {code}
> df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
> df.replace({"Alice": "Bob"}, 1)
> {code}
> - Code doesn't check if provided types are correct. This can lead to 
> exception in Py4j (harder to diagnose):
> {code}
>  df.replace({"Alice": 1}, 1)
> {code}
> or silent failures (with bundled Py4j version):
> {code}
>  df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19454) Improve DataFrame.replace API

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852046#comment-15852046
 ] 

Apache Spark commented on SPARK-19454:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/16793

> Improve DataFrame.replace API
> -
>
> Key: SPARK-19454
> URL: https://issues.apache.org/jira/browse/SPARK-19454
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation suffers from following issues:
> - It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use 
> {{None}} as the value {{value}} (although it is ignored). This requires 
> passing "magic" values:
> {code}
> df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
> df.replace({"Alice": "Bob"}, 1)
> {code}
> - Code doesn't check if provided types are correct. This can lead to 
> exception in Py4j (harder to diagnose):
> {code}
>  df.replace({"Alice": 1}, 1)
> {code}
> or silent failures (with bundled Py4j version):
> {code}
>  df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19454) Improve DataFrame.replace API

2017-02-03 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-19454:
--

 Summary: Improve DataFrame.replace API
 Key: SPARK-19454
 URL: https://issues.apache.org/jira/browse/SPARK-19454
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.2.0
Reporter: Maciej Szymkiewicz


Current implementation suffers from following issues:

- It is possible to use {{dict}} as {{to_replace}}, but we cannot skip or use 
{{None}} as the value {{value}} (although it is ignored). This requires passing 
"magic" values:
{code}
df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
df.replace({"Alice": "Bob"}, 1)
{code}
- Code doesn't check if provided types are correct. This can lead to exception 
in Py4j (harder to diagnose):
{code}
 df.replace({"Alice": 1}, 1)
{code}
or silent failures (with bundled Py4j version):
{code}
 df.replace({1: 2, 3.0: 4.1, "a": "b"}, 1)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19453:


Assignee: Apache Spark

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19453:


Assignee: (was: Apache Spark)

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19453:


Assignee: Apache Spark

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852017#comment-15852017
 ] 

Apache Spark commented on SPARK-19453:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/16792

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19453:


Assignee: (was: Apache Spark)

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19453) Correct DataFrame.replace docs

2017-02-03 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-19453:
---
Summary: Correct DataFrame.replace docs  (was: Correct Column.replace docs)

> Correct DataFrame.replace docs
> --
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19453) Correct Column.replace docs

2017-02-03 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-19453:
---
Summary: Correct Column.replace docs  (was: Correct )

> Correct Column.replace docs
> ---
>
> Key: SPARK-19453
> URL: https://issues.apache.org/jira/browse/SPARK-19453
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current docstring provides incorrect description of {{to_replace}} argument: 
> {quote}
>  If the value is a dict, then `value` is ignored and `to_replace` must be a 
> mapping from column name (string) to replacement value.
> {quote}
> It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
> provide mapping from value to replacement value. 
> Moreover docs fail to explain some fundamental limitations (like lack of 
> support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19453) Correct

2017-02-03 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-19453:
--

 Summary: Correct 
 Key: SPARK-19453
 URL: https://issues.apache.org/jira/browse/SPARK-19453
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark, SQL
Affects Versions: 2.1.0, 2.0.0, 1.6.0, 1.5.0, 2.2.0
Reporter: Maciej Szymkiewicz


Current docstring provides incorrect description of {{to_replace}} argument: 

{quote}
 If the value is a dict, then `value` is ignored and `to_replace` must be a 
mapping from column name (string) to replacement value.
{quote}

It looks like it has been copied from `na.fill` docs. In fact {{dict}} should 
provide mapping from value to replacement value. 

Moreover docs fail to explain some fundamental limitations (like lack of 
support for heterogeneous values) and some usage scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19452) Fix bug in the name assignment method in SparkR

2017-02-03 Thread Wayne Zhang (JIRA)

Wayne Zhang created SPARK-19452:
---

 Summary: Fix bug in the name assignment method in SparkR
 Key: SPARK-19452
 URL: https://issues.apache.org/jira/browse/SPARK-19452
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0, 2.2.0
Reporter: Wayne Zhang


The names method fails to check for validity of the assignment values. This can 
be fixed by calling colnames within names. See example below.

{code}
df <- suppressWarnings(createDataFrame(iris))
# this is error
colnames(df) <- NULL
# this should report error
names(df) <- NULL
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-02-03 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18539.

  Resolution: Fixed
Assignee: Dongjoon Hyun
Target Version/s: 2.2.0

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-02-03 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851965#comment-15851965
 ] 

Cheng Lian commented on SPARK-18539:


SPARK-19409 upgrades parquet-mr to 1.8.2 and fixed this issue.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   a

[jira] [Commented] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851959#comment-15851959
 ] 

Apache Spark commented on SPARK-19409:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16791

> Upgrade Parquet to 1.8.2
> 
>
> Key: SPARK-19409
> URL: https://issues.apache.org/jira/browse/SPARK-19409
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> Apache Parquet 1.8.2 is released officially last week on 26 Jan.
> This issue aims to bump Parquet version to 1.8.2 since it includes many fixes.
> https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851960#comment-15851960
 ] 

Apache Spark commented on SPARK-17213:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16791

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19451) Long values in Window function

2017-02-03 Thread Julien Champ (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Champ updated SPARK-19451:
-
Description: 
Hi there,

there seems to be a major limitation in spark window functions and rangeBetween 
method.

If I have the following code :
{code:title=Exemple |borderStyle=solid}
val tw =  Window.orderBy("date")
  .partitionBy("id")
  .rangeBetween( from , 0)
{code}

Everything seems ok, while *from* value is not too large... Even if the 
rangeBetween() method supports Long parameters.
But If i set *-216000L* value to *from* it does not work !

It is probably related to this part of code in the between() method, of the 
WindowSpec class, called by rangeBetween()

{code:title=between() method|borderStyle=solid}
val boundaryStart = start match {
  case 0 => CurrentRow
  case Long.MinValue => UnboundedPreceding
  case x if x < 0 => ValuePreceding(-start.toInt)
  case x if x > 0 => ValueFollowing(start.toInt)
}
{code}
( look at this *.toInt* )

Does anybody know it there's a way to solve / patch this behavior ?

Any help will be appreciated

Thx

  was:
Hi there,

there seems to be a major limitation in spark window functions and rangeBetween 
method.

If I have the following code :
```
val tw =  Window.orderBy("date")
  .partitionBy("id")
  .rangeBetween( from , 0)
```

Everything seems ok, while "from" value is not too large... Even if the 
rangeBetween() method supports Long parameters.
But If i set "-216000L" value to "from" it does not work !

It is probably related to this part of code in the between() method, of the 
WindowSpec class, called by rangeBetween()

```
val boundaryStart = start match {
  case 0 => CurrentRow
  case Long.MinValue => UnboundedPreceding
  case x if x < 0 => ValuePreceding(-start.toInt)
  case x if x > 0 => ValueFollowing(start.toInt)
}
```
( look at this " .toInt " )

Does anybody know it there's a way to solve / patch this behavior ?

Any help will be appreciated

Thx


> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19233) Inconsistent Behaviour of Spark Streaming Checkpoint

2017-02-03 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851787#comment-15851787
 ] 

Nan Zhu commented on SPARK-19233:
-

ping

> Inconsistent Behaviour of Spark Streaming Checkpoint
> 
>
> Key: SPARK-19233
> URL: https://issues.apache.org/jira/browse/SPARK-19233
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>
> When checking one of our application logs, we found the following behavior 
> (simplified)
> 1. Spark application recovers from checkpoint constructed at timestamp 1000ms
> 2. The log shows that Spark application can recover RDDs generated at 
> timestamp 2000, 3000
> The root cause is that generateJobs event is pushed to the queue by a 
> separate thread (RecurTimer), before doCheckpoint event is pushed to the 
> queue, there might have been multiple generatedJobs being processed. As a 
> result, when doCheckpoint for timestamp 1000 is processed, the generatedRDDs 
> data structure containing RDDs generated at 2000, 3000 is serialized as part 
> of checkpoint of 1000.
> It brings overhead for debugging and coordinate our offset management 
> strategy with Spark Streaming's checkpoint strategy when we are developing a 
> new type of DStream which integrates Spark Streaming with an internal message 
> middleware.
> The proposed fix is to filter generatedRDDs according to checkpoint timestamp 
> when serializing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19450) Replace askWithRetry with askSync.

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19450:


Assignee: (was: Apache Spark)

> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19450) Replace askWithRetry with askSync.

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19450:


Assignee: Apache Spark

> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: Apache Spark
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler

2017-02-03 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851786#comment-15851786
 ] 

Nan Zhu commented on SPARK-19280:
-

ping

> Failed Recovery from checkpoint caused by the multi-threads issue in Spark 
> Streaming scheduler
> --
>
> Key: SPARK-19280
> URL: https://issues.apache.org/jira/browse/SPARK-19280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>Priority: Critical
>
> In one of our applications, we found the following issue, the application 
> recovering from a checkpoint file named "checkpoint-***16670" but with 
> the timestamp ***16650 will recover from the very beginning of the stream 
> and because our application relies on the external & periodically-cleaned 
> data (syncing with checkpoint cleanup), the recovery just failed
> We identified a potential issue in Spark Streaming checkpoint and will 
> describe it with the following example. We will propose a fix in the end of 
> this JIRA.
> 1. The application properties: Batch Duration: 2, Functionality: Single 
> Stream calling ReduceByKeyAndWindow and print, Window Size: 6, 
> SlideDuration, 2
> 2. RDD at 16650 is generated and the corresponding job is submitted to 
> the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the 
> queue of JobGenerator
> 3. Job at 16650 is finished and JobCompleted message is sent to 
> JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the 
> execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of 
> JobGenerator
> 4. JobScheduler's message processing thread (I will use JS-EventLoop to 
> identify it) is not scheduled by the operating system for a long time, and 
> during this period, Jobs generated from 16652 - 16670 are generated 
> and completed.
> 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled 
> and processed all DoCheckpoint messages for jobs ranging from 16652 - 
> 16670 and checkpoint files are successfully written. CRITICAL: at this 
> moment, the lastCheckpointTime would be 16670.
> 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs 
> ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is 
> pushed to JobGenerator's message queue for EACH JobCompleted.
> 7. The current message queue contains 20 ClearMetadata messages and 
> JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will 
> remove all RDDs out of rememberDuration window. In our case, 
> ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of 
> ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- 
> (16660, 16670] are kept. And ClearMetaData processing logic will push 
> a DoCheckpoint to JobGenerator's thread
> 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY 
> CRITICAL: at this step, RDD no later than 16660 has been removed, and 
> checkpoint data is updated as  
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53
>  and 
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59.
> 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with 
> the timestamp 16650. and at this moment, Application crashed
> 10. Application recovers from /path/checkpoint-16670 and try to get RDD 
> with validTime 16650. Of course it will not find it and has to recompute. 
> In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until 
> to the start of the stream. When the stream depends on the external data, it 
> will not successfully recover. In the case of Kafka, the recovered RDDs would 
> not be the same as the original one, as the currentOffsets has been updated 
> to the value at the moment of 16670
> The proposed fix:
> 0. a hot-fix would be setting timestamp Checkpoint File to lastCheckpointTime 
> instead of using the timestamp of Checkpoint instance (any side-effect?)
> 1. ClearMetadata shall be ClearMedataAndCheckpoint 
> 2. a long-term fix would be merge JobScheduler and JobGenerator, I didn't see 
> any necessary to have two threads here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19450) Replace askWithRetry with askSync.

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851785#comment-15851785
 ] 

Apache Spark commented on SPARK-19450:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/16790

> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19428) Ability to select first row of groupby

2017-02-03 Thread Luke Miner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851784#comment-15851784
 ] 

Luke Miner edited comment on SPARK-19428 at 2/3/17 5:33 PM:


Couple of things. Sometimes I just want a random row from each group. However, 
if sorting is allowed, then it is useful to get a single row from each group 
based on the maximum or minimum of some column. Even more useful would be 
something like

{{df.groupby('group').orderBy('foo').limit(10)}}

Then it would be easy to get the top 10 observations for each group based on 
some criteria.


was (Author: lminer):
Couple of things. Sometimes I just want a random row from each group. However, 
if sorting is allowed, then it is useful to get a single row from each group 
based on the maximum or minimum of some column. Even more useful would be 
something like

{{df.groupby('group').orderBy('foo').limit(n)}}

Then it would be easy to get the top {{n}} observations for each group based on 
some criteria.

> Ability to select first row of groupby
> --
>
> Key: SPARK-19428
> URL: https://issues.apache.org/jira/browse/SPARK-19428
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Luke Miner
>Priority: Minor
>
> It would be nice to be able to select the first row from {{GroupedData}}. 
> Pandas has something like this:
> {{df.groupby('group').first()}}
> It's especially handy if you can order the group as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19428) Ability to select first row of groupby

2017-02-03 Thread Luke Miner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851784#comment-15851784
 ] 

Luke Miner commented on SPARK-19428:


Couple of things. Sometimes I just want a random row from each group. However, 
if sorting is allowed, then it is useful to get a single row from each group 
based on the maximum or minimum of some column. Even more useful would be 
something like

{{df.groupby('group').orderBy('foo').limit(n)}}

Then it would be easy to get the top {{n}} observations for each group based on 
some criteria.

> Ability to select first row of groupby
> --
>
> Key: SPARK-19428
> URL: https://issues.apache.org/jira/browse/SPARK-19428
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Luke Miner
>Priority: Minor
>
> It would be nice to be able to select the first row from {{GroupedData}}. 
> Pandas has something like this:
> {{df.groupby('group').first()}}
> It's especially handy if you can order the group as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19451) Long values in Window function

2017-02-03 Thread Julien Champ (JIRA)

Julien Champ created SPARK-19451:


 Summary: Long values in Window function
 Key: SPARK-19451
 URL: https://issues.apache.org/jira/browse/SPARK-19451
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2, 1.6.1
Reporter: Julien Champ


Hi there,

there seems to be a major limitation in spark window functions and rangeBetween 
method.

If I have the following code :
```
val tw =  Window.orderBy("date")
  .partitionBy("id")
  .rangeBetween( from , 0)
```

Everything seems ok, while "from" value is not too large... Even if the 
rangeBetween() method supports Long parameters.
But If i set "-216000L" value to "from" it does not work !

It is probably related to this part of code in the between() method, of the 
WindowSpec class, called by rangeBetween()

```
val boundaryStart = start match {
  case 0 => CurrentRow
  case Long.MinValue => UnboundedPreceding
  case x if x < 0 => ValuePreceding(-start.toInt)
  case x if x > 0 => ValueFollowing(start.toInt)
}
```
( look at this " .toInt " )

Does anybody know it there's a way to solve / patch this behavior ?

Any help will be appreciated

Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19450) Replace askWithRetry with askSync.

2017-02-03 Thread jin xing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-19450:
-
Description: 
*askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
*askWithRetry* is marked as deprecated. 
As mentioned 
SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):

??askWithRetry is basically an unneeded API, and a leftover from the akka days 
that doesn't make sense anymore. It's prone to cause deadlocks (exactly because 
it's blocking), it imposes restrictions on the caller (e.g. idempotency) and 
other things that people generally don't pay that much attention to when using 
it.??

Since *askWithRetry* is just used inside spark and not in user logic. It might 
make sense to replace all of them with *askSync*.

  was:
*askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
*askWithRetry* is marked as deprecated. 
As mentioned 
SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):

??askWithRetry is that it's basically an unneeded API, and a leftover from the 
akka days that doesn't make sense anymore. It's prone to cause deadlocks 
(exactly because it's blocking), it imposes restrictions on the caller (e.g. 
idempotency) and other things that people generally don't pay that much 
attention to when using it.??

Since *askWithRetry* is just used inside spark and not in user logic. It might 
make sense to replace all of them with *askSync*.


> Replace askWithRetry with askSync.
> --
>
> Key: SPARK-19450
> URL: https://issues.apache.org/jira/browse/SPARK-19450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> *askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
> https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
> *askWithRetry* is marked as deprecated. 
> As mentioned 
> SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):
> ??askWithRetry is basically an unneeded API, and a leftover from the akka 
> days that doesn't make sense anymore. It's prone to cause deadlocks (exactly 
> because it's blocking), it imposes restrictions on the caller (e.g. 
> idempotency) and other things that people generally don't pay that much 
> attention to when using it.??
> Since *askWithRetry* is just used inside spark and not in user logic. It 
> might make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19450) Replace askWithRetry with askSync.

2017-02-03 Thread jin xing (JIRA)

jin xing created SPARK-19450:


 Summary: Replace askWithRetry with askSync.
 Key: SPARK-19450
 URL: https://issues.apache.org/jira/browse/SPARK-19450
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: jin xing


*askSync* is already added in *RpcEndpointRef* (see SPARK-19347 and 
https://github.com/apache/spark/pull/16690#issuecomment-276850068) and 
*askWithRetry* is marked as deprecated. 
As mentioned 
SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):

??askWithRetry is that it's basically an unneeded API, and a leftover from the 
akka days that doesn't make sense anymore. It's prone to cause deadlocks 
(exactly because it's blocking), it imposes restrictions on the caller (e.g. 
idempotency) and other things that people generally don't pay that much 
attention to when using it.??

Since *askWithRetry* is just used inside spark and not in user logic. It might 
make sense to replace all of them with *askSync*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851631#comment-15851631
 ] 

Apache Spark commented on SPARK-19448:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16787

> unify some duplication function in MetaStoreRelation
> 
>
> Key: SPARK-19448
> URL: https://issues.apache.org/jira/browse/SPARK-19448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
> toHiveTable
> 2. MetaStoreRelation's toHiveColumn can be replaced by calling 
> HiveClientImpl's toHiveColumn
> 3. process another TODO
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19448:


Assignee: (was: Apache Spark)

> unify some duplication function in MetaStoreRelation
> 
>
> Key: SPARK-19448
> URL: https://issues.apache.org/jira/browse/SPARK-19448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
> toHiveTable
> 2. MetaStoreRelation's toHiveColumn can be replaced by calling 
> HiveClientImpl's toHiveColumn
> 3. process another TODO
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19448:


Assignee: Apache Spark

> unify some duplication function in MetaStoreRelation
> 
>
> Key: SPARK-19448
> URL: https://issues.apache.org/jira/browse/SPARK-19448
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>Priority: Minor
>
> 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
> toHiveTable
> 2. MetaStoreRelation's toHiveColumn can be replaced by calling 
> HiveClientImpl's toHiveColumn
> 3. process another TODO
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851593#comment-15851593
 ] 

Sean Owen commented on SPARK-19449:
---

I don't think this can be made fully deterministic even when setting seeds in 
certain places. If the results are quite similar and sensible I don't think 
this is a problem

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
> "");
> }
> static Dataset getDummyData(SparkSession spark, int numberRows, int 
> numberFeatures, int labelUpperBound) {
> StructType schema = new StructType(new StructField[]{
> new StructField("label", DataTypes.DoubleType, false, 
> Metadata.empty()),
> new StructField("features", new VectorUDT(), false, 
> Metada

[jira] [Assigned] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19444:


Assignee: Apache Spark

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Assignee: Apache Spark
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851589#comment-15851589
 ] 

Apache Spark commented on SPARK-19444:
--

User 'anshbansal' has created a pull request for this issue:
https://github.com/apache/spark/pull/16789

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851588#comment-15851588
 ] 

Aseem Bansal commented on SPARK-19444:
--

https://github.com/apache/spark/pull/16789

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19444:


Assignee: (was: Apache Spark)

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851576#comment-15851576
 ] 

Aseem Bansal commented on SPARK-19449:
--

Isn't the decision tree debug string print it as a series of IF-ELSE? I printed 
the debug string for the 2 random forest models and it was exactly the same. In 
other words the 2 implementations should be mathematically equivalent. 

The random processes for selecting data should not cause any issues as I 
ensured that the exact same data is going to both versions. It works for 
decision trees and random forest classifier is just majority vote of bunch of 
decision trees classifiers so I cannot see how that could be different.

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
>

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851568#comment-15851568
 ] 

Aseem Bansal commented on SPARK-19449:
--

[~srowen]
I removed some extra code. The part where I did the conversion is at the end in 
convertRandomForestModel method.

Basically the above code does this
- Prepare 1000 rows of data with 3 features randomly. Prepare 1000 labels 
randomly. I am not working on creating the model but the conversion. So having 
random data is not an issue. It will just be a horrible model.
- Split the data in 80/20 ratio for training/test
- train ml version of decision tree model and random forest model using the 
training set. Let's call them DT1 and RF1
- convert these to mllib version of the models. Let's call them DT2 and RF2
- Use the test set to predict labels using DT1, DT2, RF1, RF2. 
- Compare predicted labels DT1 with DT2. Same results
- Compare predicted labels RF1 with RF2. Different results.

There should not be any random results here as I have used seeds for random 
number generators everywhere and then used the exact same data for doing 
predictions using all 4 models. 

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. So the behavior 
between the 2 implementations is inconsistent which should not be the case.

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();


Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);

System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +
"");
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.empty())
});

double[][] vectors = prepareData(numberRows, numberFeatures);

Random random = new Random(seed);
List dataTest = new ArrayList<>();
for (double[] vector : vectors) {
double label = (double) random.nextInt(2);
dataTest.add(RowFactory.create(label, Vectors.dense(vector)));
}

return spark.createDataFrame(dataTest, schema);
}

static double[][] prepareData(int numRows, int numFeatures) {

Random random = new Random(seed);

double[][] result = new double[numRows][numFeatures];

for (int row = 0; row < numRows; row++) {
for (int feature = 0; feature < numFeatures; feature++) {
result[row][feature] = random.nextDouble();
}
}

return result;
}

static S

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. 

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.empty())
});

double[][] vectors = prepareData(numberRows, numberFeatures);

[jira] [Updated] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2017-02-03 Thread Nattavut Sutyanyong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nattavut Sutyanyong updated SPARK-18874:

Attachment: SPARK-18874-3.pdf

Design document version 1.1 dated February 3, 2017.

> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2017-02-03 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851550#comment-15851550
 ] 

Nattavut Sutyanyong commented on SPARK-18874:
-

I have published a design document as a reference when reviewing the code.

https://issues.apache.org/jira/secure/attachment/12850832/SPARK-18874-3.pdf



> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. So the behavior 
between the 2 implementations is inconsistent which should not be the case.

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.em

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851540#comment-15851540
 ] 

Sean Owen commented on SPARK-19449:
---

Can you boil this down? this is a lot of code to look at.
I would not necessarily expect the exact same results, even though a lot of 
code is shared, because of randomness and differences in ancillary processes 
like the pipeline elements that select training data and perform evaluation.

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. 
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> //Dataset data = getData(spark, "libsvm", 
> "/opt/spark2/data/mllib/sample_libsvm_data.txt");
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> LogisticRegression lr = new LogisticRegression();
> LogisticRegressionModel model3 = lr.fit(trainingData);
> final org.apache.spark.mllib.classification.LogisticRegressionModel 
> convertedModel3 = convertLogisticRegressionModel(model3);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
>

[jira] [Assigned] (SPARK-16742) Kerberos support for Spark on Mesos

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16742:


Assignee: Apache Spark

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16742) Kerberos support for Spark on Mesos

2017-02-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16742:


Assignee: (was: Apache Spark)

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-02-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851539#comment-15851539
 ] 

Apache Spark commented on SPARK-16742:
--

User 'arinconstrio' has created a pull request for this issue:
https://github.com/apache/spark/pull/16788

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-19449:


 Summary: Inconsistent results between ml package 
RandomForestClassificationModel and mllib package RandomForestModel
 Key: SPARK-19449
 URL: https://issues.apache.org/jira/browse/SPARK-19449
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Aseem Bansal


I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. 

The below code can be used to reproduce the issue. 

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
n

[jira] [Assigned] (SPARK-19244) Sort MemoryConsumers according to their memory usage when spilling

2017-02-03 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-19244:
---

Assignee: Liang-Chi Hsieh

> Sort MemoryConsumers according to their memory usage when spilling
> --
>
> Key: SPARK-19244
> URL: https://issues.apache.org/jira/browse/SPARK-19244
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> In `TaskMemoryManager `, when we acquire memory by calling 
> `acquireExecutionMemory` and we can't acquire required memory, we will try to 
> spill other memory consumers.
> Currently, we simply iterates the memory consumers in a hash set. Normally 
> each time the consumer will be iterated in the same order.
> The first issue is that we might spill additional consumers. For example, if 
> consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB 
> but we can only get 60MB and spilling is needed. We might spill both consumer 
> 1 and consumer 2. But we actually just need to spill consumer 2 and get the 
> required 100MB.
> The second issue is that if we spill consumer 1 in first time spilling. After 
> a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory and 
> spilling is needed again. Because we iterate the memory consumers in the same 
> order, we will spill consumer 1 again. So for consumer 1, we will produce 
> many small spilling files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19244) Sort MemoryConsumers according to their memory usage when spilling

2017-02-03 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-19244.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16603
[https://github.com/apache/spark/pull/16603]

> Sort MemoryConsumers according to their memory usage when spilling
> --
>
> Key: SPARK-19244
> URL: https://issues.apache.org/jira/browse/SPARK-19244
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> In `TaskMemoryManager `, when we acquire memory by calling 
> `acquireExecutionMemory` and we can't acquire required memory, we will try to 
> spill other memory consumers.
> Currently, we simply iterates the memory consumers in a hash set. Normally 
> each time the consumer will be iterated in the same order.
> The first issue is that we might spill additional consumers. For example, if 
> consumer 1 uses 10MB, consumer 2 uses 50MB, then consumer 3 acquires 100MB 
> but we can only get 60MB and spilling is needed. We might spill both consumer 
> 1 and consumer 2. But we actually just need to spill consumer 2 and get the 
> required 100MB.
> The second issue is that if we spill consumer 1 in first time spilling. After 
> a while, consumer 1 now uses 5MB. Then consumer 4 may acquire some memory and 
> spilling is needed again. Because we iterate the memory consumers in the same 
> order, we will spill consumer 1 again. So for consumer 1, we will produce 
> many small spilling files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851515#comment-15851515
 ] 

Sean Owen commented on SPARK-19444:
---

You're right, I think this may be a copy-and-paste problem. I don't know much 
about the example include mechanism, but it looks like this is a way to tag 
code as part of only a certain example. In this case, we don't want that tag 
whereas it might have been relevant in its source. You can just make all of the 
imports part of one "example on" block.

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array

2017-02-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-16043.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Prepare GenericArrayData implementation specialized for a primitive array
> -
>
> Key: SPARK-16043
> URL: https://issues.apache.org/jira/browse/SPARK-16043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> There is a ToDo of GenericArrayData class, which is to eliminate 
> boxing/unboxing for a primitive array (described 
> [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31])
> It would be good to prepare GenericArrayData implementation specialized for a 
> primitive array to eliminate boxing/unboxing from the view of runtime memory 
> footprint and performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16042) Eliminate nullcheck code at projection for an array type

2017-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16042.
---
Resolution: Duplicate

Subsumed by another issue according to PR

> Eliminate nullcheck code at projection for an array type
> 
>
> Key: SPARK-16042
> URL: https://issues.apache.org/jira/browse/SPARK-16042
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we run a spark program with a projection for a array type, nullcheck at 
> a call to write each element of an array is generated. If we know all of the 
> elements do not have {{null}} at compilation time, we can eliminate code for 
> nullcheck.
> {code}
> val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v")
> df.selectExpr("Array(v + 2.2, v + 3.3)").collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16094) Support HashAggregateExec for non-partial aggregates

2017-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16094.
---
Resolution: Won't Fix

> Support HashAggregateExec for non-partial aggregates
> 
>
> Key: SPARK-16094
> URL: https://issues.apache.org/jira/browse/SPARK-16094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> The current spark cannot use `HashAggregateExec` for non-partial aggregates 
> because `Collect` (`CollectSet`/`CollectList`) has a single shared buffer 
> inside. Since SortAggregateExec is expensive for bigger data, we'd better off 
> fixing this.
> This ticket is intended to change plans from
> {code}
> SortAggregate(key=[key#3077], functions=[collect_set(value#3078, 0, 0)], 
> output=[key#3077,collect_set(value)#3088])
> +- *Sort [key#3077 ASC], false, 0
>+- Exchange hashpartitioning(key#3077, 5)
>   +- Scan ExistingRDD[key#3077,value#3078]
> {code}
> into
> {code}
> HashAggregate(keys=[key#3077], functions=[collect_set(value#3078, 0, 0)], 
> output=[key#3077, collect_set(value)#3088])
> +- Exchange hashpartitioning(key#3077, 5)
>+- Scan ExistingRDD[key#3077,value#3078]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`

2017-02-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16041.
---
Resolution: Duplicate

This was apparently subsumed by another issue.

> Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
> 
>
> Key: SPARK-16041
> URL: https://issues.apache.org/jira/browse/SPARK-16041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in 
> DataFrameWriter. The duplicate columns could cause unpredictable results. For 
> example, the resolution failure. 
> We should detect the duplicates and issue exceptions with appropriate 
> messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16200) Rename AggregateFunction#supportsPartial

2017-02-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-16200.

Resolution: Won't Fix

> Rename AggregateFunction#supportsPartial
> 
>
> Key: SPARK-16200
> URL: https://issues.apache.org/jira/browse/SPARK-16200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> We'd be also better to rename this variable instead of supportsPartial 
> because it's kinds of misleading.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16200) Rename AggregateFunction#supportsPartial

2017-02-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851501#comment-15851501
 ] 

Takeshi Yamamuro edited comment on SPARK-16200 at 2/3/17 1:54 PM:
--

okay, thanks for letting me know! It's okay to set "Won't Fix".


was (Author: maropu):
okay, thanks for letting me know! It's okay to set "Resolved".

> Rename AggregateFunction#supportsPartial
> 
>
> Key: SPARK-16200
> URL: https://issues.apache.org/jira/browse/SPARK-16200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> We'd be also better to rename this variable instead of supportsPartial 
> because it's kinds of misleading.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16200) Rename AggregateFunction#supportsPartial

2017-02-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851501#comment-15851501
 ] 

Takeshi Yamamuro commented on SPARK-16200:
--

okay, thanks for letting me know! It's okay to set "Resolved".

> Rename AggregateFunction#supportsPartial
> 
>
> Key: SPARK-16200
> URL: https://issues.apache.org/jira/browse/SPARK-16200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> We'd be also better to rename this variable instead of supportsPartial 
> because it's kinds of misleading.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array

2017-02-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851500#comment-15851500
 ] 

Takeshi Yamamuro commented on SPARK-16043:
--

I think the issue this ticket describes has been almost resolved in SPARK-14850.
How about setting `Resolved` about this? And then, if we have other left issues 
related to this, it'd be better to open new JIRAs, thought?

> Prepare GenericArrayData implementation specialized for a primitive array
> -
>
> Key: SPARK-16043
> URL: https://issues.apache.org/jira/browse/SPARK-16043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> There is a ToDo of GenericArrayData class, which is to eliminate 
> boxing/unboxing for a primitive array (described 
> [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31])
> It would be good to prepare GenericArrayData implementation specialized for a 
> primitive array to eliminate boxing/unboxing from the view of runtime memory 
> footprint and performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15180) Support subexpression elimination in Fliter

2017-02-03 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-15180.
---
Resolution: Won't Fix

> Support subexpression elimination in Fliter
> ---
>
> Key: SPARK-15180
> URL: https://issues.apache.org/jira/browse/SPARK-15180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Wholestage filter doesn't support subexpression elimination now. We should 
> add this support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15180) Support subexpression elimination in Fliter

2017-02-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851491#comment-15851491
 ] 

Liang-Chi Hsieh commented on SPARK-15180:
-

[~hyukjin.kwon] Yes. I resolved this. Thanks!

> Support subexpression elimination in Fliter
> ---
>
> Key: SPARK-15180
> URL: https://issues.apache.org/jira/browse/SPARK-15180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Wholestage filter doesn't support subexpression elimination now. We should 
> add this support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15911) Remove additional Project to be consistent with SQL when insert into table

2017-02-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851488#comment-15851488
 ] 

Liang-Chi Hsieh commented on SPARK-15911:
-

[~hyukjin.kwon] Thanks!

> Remove additional Project to be consistent with SQL when insert into table
> --
>
> Key: SPARK-15911
> URL: https://issues.apache.org/jira/browse/SPARK-15911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently In DataFrameWriter's insertInto and ResolveRelations of Analyzer, 
> we add additional Project to adjust column ordering. However, it should be 
> using ordering not name for this resolution. We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2017-02-03 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-17161.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 2.2.0
>
>
> Often in Spark ML, there are classes that use a Scala Array in a constructor. 
>  In order to add the same API to Python, a Java-friendly alternate 
> constructor needs to exist to be compatible with py4j when converting from a 
> list.  This is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala Array currently 
> used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16200) Rename AggregateFunction#supportsPartial

2017-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851474#comment-15851474
 ] 

Hyukjin Kwon commented on SPARK-16200:
--

(maybe it seems good to double-check this one too per 
https://github.com/apache/spark/pull/13852#issuecomment-242347430)

> Rename AggregateFunction#supportsPartial
> 
>
> Key: SPARK-16200
> URL: https://issues.apache.org/jira/browse/SPARK-16200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> We'd be also better to rename this variable instead of supportsPartial 
> because it's kinds of misleading.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Song Jun (JIRA)

Song Jun created SPARK-19448:


 Summary: unify some duplication function in MetaStoreRelation
 Key: SPARK-19448
 URL: https://issues.apache.org/jira/browse/SPARK-19448
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
toHiveTable
2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's 
toHiveColumn
3. process another TODO
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16094) Support HashAggregateExec for non-partial aggregates

2017-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851473#comment-15851473
 ] 

Hyukjin Kwon commented on SPARK-16094:
--

[~maropu], I just happened to see this JIRA. Maybe would this JIRA be 
resolvable per 
https://github.com/apache/spark/pull/13802#issuecomment-243756620?

> Support HashAggregateExec for non-partial aggregates
> 
>
> Key: SPARK-16094
> URL: https://issues.apache.org/jira/browse/SPARK-16094
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> The current spark cannot use `HashAggregateExec` for non-partial aggregates 
> because `Collect` (`CollectSet`/`CollectList`) has a single shared buffer 
> inside. Since SortAggregateExec is expensive for bigger data, we'd better off 
> fixing this.
> This ticket is intended to change plans from
> {code}
> SortAggregate(key=[key#3077], functions=[collect_set(value#3078, 0, 0)], 
> output=[key#3077,collect_set(value)#3088])
> +- *Sort [key#3077 ASC], false, 0
>+- Exchange hashpartitioning(key#3077, 5)
>   +- Scan ExistingRDD[key#3077,value#3078]
> {code}
> into
> {code}
> HashAggregate(keys=[key#3077], functions=[collect_set(value#3078, 0, 0)], 
> output=[key#3077, collect_set(value)#3088])
> +- Exchange hashpartitioning(key#3077, 5)
>+- Scan ExistingRDD[key#3077,value#3078]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-02-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851472#comment-15851472
 ] 

Kazuaki Ishizaki commented on SPARK-19372:
--

I was able to reproduce this. I am thinking how to reduce bytecode size per 
Java method.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16043) Prepare GenericArrayData implementation specialized for a primitive array

2017-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851471#comment-15851471
 ] 

Hyukjin Kwon commented on SPARK-16043:
--

(maybe would be great if this one is checked too per 
https://github.com/apache/spark/pull/13758#issuecomment-269589087)

> Prepare GenericArrayData implementation specialized for a primitive array
> -
>
> Key: SPARK-16043
> URL: https://issues.apache.org/jira/browse/SPARK-16043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> There is a ToDo of GenericArrayData class, which is to eliminate 
> boxing/unboxing for a primitive array (described 
> [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31])
> It would be good to prepare GenericArrayData implementation specialized for a 
> primitive array to eliminate boxing/unboxing from the view of runtime memory 
> footprint and performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16042) Eliminate nullcheck code at projection for an array type

2017-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851469#comment-15851469
 ] 

Hyukjin Kwon commented on SPARK-16042:
--

[~kiszk], would this JIRA maybe be resolvable per 
https://github.com/apache/spark/pull/13757#issuecomment-270453328?

> Eliminate nullcheck code at projection for an array type
> 
>
> Key: SPARK-16042
> URL: https://issues.apache.org/jira/browse/SPARK-16042
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we run a spark program with a projection for a array type, nullcheck at 
> a call to write each element of an array is generated. If we know all of the 
> elements do not have {{null}} at compilation time, we can eliminate code for 
> nullcheck.
> {code}
> val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v")
> df.selectExpr("Array(v + 2.2, v + 3.3)").collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`

2017-02-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851468#comment-15851468
 ] 

Hyukjin Kwon commented on SPARK-16041:
--

[~smilegator], I just happened to see this JIRA just while looking through the 
history. Per https://github.com/apache/spark/pull/13756#issuecomment-237658676, 
would this JIRA be resolvable maybe?

> Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
> 
>
> Key: SPARK-16041
> URL: https://issues.apache.org/jira/browse/SPARK-16041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in 
> DataFrameWriter. The duplicate columns could cause unpredictable results. For 
> example, the resolution failure. 
> We should detect the duplicates and issue exceptions with appropriate 
> messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15911) Remove additional Project to be consistent with SQL when insert into table

2017-02-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15911.
--
Resolution: Duplicate

I am resolving this per 
https://github.com/apache/spark/pull/13631#issuecomment-227087585

> Remove additional Project to be consistent with SQL when insert into table
> --
>
> Key: SPARK-15911
> URL: https://issues.apache.org/jira/browse/SPARK-15911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently In DataFrameWriter's insertInto and ResolveRelations of Analyzer, 
> we add additional Project to adjust column ordering. However, it should be 
> using ordering not name for this resolution. We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 129 matches

Mail list logo