[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC

2017-01-31 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848099#comment-15848099
 ] 

yuhao yang commented on SPARK-19337:


I'll start work on this if no one has started.

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19421:


Assignee: Apache Spark

> Remove numClasses and numFeatures methods in LinearSVC
> --
>
> Key: SPARK-19421
> URL: https://issues.apache.org/jira/browse/SPARK-19421
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> As suggested by [~holdenk], open another jira to track the following issue:
> Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already 
> usable by inheriting {{JavaClassificationModel}},
> we should not explicitly add them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19421:


Assignee: (was: Apache Spark)

> Remove numClasses and numFeatures methods in LinearSVC
> --
>
> Key: SPARK-19421
> URL: https://issues.apache.org/jira/browse/SPARK-19421
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> As suggested by [~holdenk], open another jira to track the following issue:
> Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already 
> usable by inheriting {{JavaClassificationModel}},
> we should not explicitly add them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848063#comment-15848063
 ] 

Apache Spark commented on SPARK-19421:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16727

> Remove numClasses and numFeatures methods in LinearSVC
> --
>
> Key: SPARK-19421
> URL: https://issues.apache.org/jira/browse/SPARK-19421
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> As suggested by [~holdenk], open another jira to track the following issue:
> Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already 
> usable by inheriting {{JavaClassificationModel}},
> we should not explicitly add them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC

2017-01-31 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-19421:


 Summary: Remove numClasses and numFeatures methods in LinearSVC
 Key: SPARK-19421
 URL: https://issues.apache.org/jira/browse/SPARK-19421
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: zhengruifeng
Priority: Minor


As suggested by [~holdenk], open another jira to track the following issue:
Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already 
usable by inheriting {{JavaClassificationModel}},
we should not explicitly add them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19420) Confusing error message when using outer join on two large tables

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848058#comment-15848058
 ] 

Apache Spark commented on SPARK-19420:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16762

> Confusing error message when using outer join on two large tables
> -
>
> Key: SPARK-19420
> URL: https://issues.apache.org/jira/browse/SPARK-19420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When users use the outer joins where both sides of the tables are too large 
> to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS 
> JOIN syntax is unable to cover the scenario of outer join, but we still issue 
> the following error message:
> {noformat}
> Use the CROSS JOIN syntax to allow cartesian products between these relations
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19420) Confusing error message when using outer join on two large tables

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19420:


Assignee: Apache Spark  (was: Xiao Li)

> Confusing error message when using outer join on two large tables
> -
>
> Key: SPARK-19420
> URL: https://issues.apache.org/jira/browse/SPARK-19420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When users use the outer joins where both sides of the tables are too large 
> to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS 
> JOIN syntax is unable to cover the scenario of outer join, but we still issue 
> the following error message:
> {noformat}
> Use the CROSS JOIN syntax to allow cartesian products between these relations
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848057#comment-15848057
 ] 

Apache Spark commented on SPARK-19419:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16762

> Unable to detect all the cases of cartesian products when 
> spark.sql.crossJoin.enabled is false
> --
>
> Key: SPARK-19419
> URL: https://issues.apache.org/jira/browse/SPARK-19419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The existing detection is unable to cover all the cartesian product cases. 
> For example, 
>   - Case 1) having non-equal predicates in join conditiions of an inner join.
>   - Case 2) equi-join's key columns are not sortable and both sides are not 
> small enough for broadcasting.
> This PR is to move the cross-join detection back to 
> `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19419:


Assignee: Xiao Li  (was: Apache Spark)

> Unable to detect all the cases of cartesian products when 
> spark.sql.crossJoin.enabled is false
> --
>
> Key: SPARK-19419
> URL: https://issues.apache.org/jira/browse/SPARK-19419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The existing detection is unable to cover all the cartesian product cases. 
> For example, 
>   - Case 1) having non-equal predicates in join conditiions of an inner join.
>   - Case 2) equi-join's key columns are not sortable and both sides are not 
> small enough for broadcasting.
> This PR is to move the cross-join detection back to 
> `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19420) Confusing error message when using outer join on two large tables

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19420:


Assignee: Xiao Li  (was: Apache Spark)

> Confusing error message when using outer join on two large tables
> -
>
> Key: SPARK-19420
> URL: https://issues.apache.org/jira/browse/SPARK-19420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When users use the outer joins where both sides of the tables are too large 
> to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS 
> JOIN syntax is unable to cover the scenario of outer join, but we still issue 
> the following error message:
> {noformat}
> Use the CROSS JOIN syntax to allow cartesian products between these relations
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19419:


Assignee: Apache Spark  (was: Xiao Li)

> Unable to detect all the cases of cartesian products when 
> spark.sql.crossJoin.enabled is false
> --
>
> Key: SPARK-19419
> URL: https://issues.apache.org/jira/browse/SPARK-19419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The existing detection is unable to cover all the cartesian product cases. 
> For example, 
>   - Case 1) having non-equal predicates in join conditiions of an inner join.
>   - Case 2) equi-join's key columns are not sortable and both sides are not 
> small enough for broadcasting.
> This PR is to move the cross-join detection back to 
> `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19420) Confusing error message when using outer join on two large tables

2017-01-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19420:
---

 Summary: Confusing error message when using outer join on two 
large tables
 Key: SPARK-19420
 URL: https://issues.apache.org/jira/browse/SPARK-19420
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


When users use the outer joins where both sides of the tables are too large to 
be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS JOIN 
syntax is unable to cover the scenario of outer join, but we still issue the 
following error message:
{noformat}
Use the CROSS JOIN syntax to allow cartesian products between these relations
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false

2017-01-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19419:
---

 Summary: Unable to detect all the cases of cartesian products when 
spark.sql.crossJoin.enabled is false
 Key: SPARK-19419
 URL: https://issues.apache.org/jira/browse/SPARK-19419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


The existing detection is unable to cover all the cartesian product cases. For 
example, 
  - Case 1) having non-equal predicates in join conditiions of an inner join.
  - Case 2) equi-join's key columns are not sortable and both sides are not 
small enough for broadcasting.

This PR is to move the cross-join detection back to 
`BroadcastNestedLoopJoinExec` and `CartesianProductExec`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848051#comment-15848051
 ] 

Apache Spark commented on SPARK-19319:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16761

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-01-31 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848042#comment-15848042
 ] 

Gaurav Shah commented on SPARK-19304:
-

[~srowen] tried working through the code but unsure how to write clean code in 
this state. `KinesisBackedBlockRDD.getPartitions` returns one partition per 
block. But if we have one partition per block, we are unable to optimise the 
`getRecords` call since we have generally one sequence number per block.

Tried modifying `KinesisBackedBlockRDD.getPartitions` to return one partition 
for validBlockIds & one partition for invalidBlockIds. That makes 
`KinesisBackedBlockRDDPartition` look odd since it now accepts an array of 
`blockIds` but is inheriting from `BlockRdd` which is one per block. Tried this 
path & this is working. Also got down the recovery time to same as batch 
processing time.

Any direction where I can improve the code piece ?

I can open a pull request to see the differences

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>  Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19319.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor

2017-01-31 Thread Suresh Avadhanula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848017#comment-15848017
 ] 

Suresh Avadhanula commented on SPARK-19418:
---

I can attach the sample code to reproduce the issue if required.


> Dataset generated java code fails to compile as java.lang.Long does not 
> accept UTF8String in constructor
> 
>
> Key: SPARK-19418
> URL: https://issues.apache.org/jira/browse/SPARK-19418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Suresh Avadhanula
>
> I have the following in Java spark driver. 
> DealerPerson module object is
> {code:title=DealerPerson.java|borderStyle=solid}
> public class DealerPerson
> {
> Long schemaOrgUnitId ;
> List personList 
> }
> {code}
> I populate it using group by as follows.
> {code}
>   Dataset dps = persondds.groupByKey(new MapFunction Long>() {
> @Override
> public Long call(Person person) throws Exception {
> return person.getSchemaOrgUnitId();
> }
> }, Encoders.LONG()).
> mapGroups(new MapGroupsFunction() 
> {
> @Override
> public DealerPerson call(Long dp, 
> java.util.Iterator iterator) throws Exception {
> DealerPerson retDp = new DealerPerson();
> retDp.setSchemaOrgUnitId(dp);
> ArrayList persons = new ArrayList();
> while (iterator.hasNext())
> persons.add(iterator.next());
> retDp.setPersons(persons);
> return retDp;
> }
> }, Encoders.bean(DealerPerson.class));
> {code}
> The generated code throws compiler exception since UTF8String is 
> java.lang.Long() 
> {noformat}
> 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes)
> 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
> 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 58: No applicable constructor/method found for actual parameters 
> "org.apache.spark.unsafe.types.UTF8String"; candidates are: 
> "java.lang.Long(long)", "java.lang.Long(java.lang.String)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private InternalRow mutableRow;
> /* 009 */   private com.xtime.spark.model.Person javaBean;
> /* 010 */   private boolean resultIsNull;
> /* 011 */   private UTF8String argValue;
> /* 012 */   private boolean resultIsNull1;
> /* 013 */   private UTF8String argValue1;
> /* 014 */   private boolean resultIsNull2;
> /* 015 */   private UTF8String argValue2;
> /* 016 */   private boolean resultIsNull3;
> /* 017 */   private UTF8String argValue3;
> /* 018 */   private boolean resultIsNull4;
> /* 019 */   private UTF8String argValue4;
> /* 020 */
> /* 021 */   public SpecificSafeProjection(Object[] references) {
> /* 022 */ this.references = references;
> /* 023 */ mutableRow = (InternalRow) references[references.length - 1];
> /* 024 */
> /* 025 */
> /* 026 */
> /* 027 */
> /* 028 */
> /* 029 */
> /* 030 */
> /* 031 */
> /* 032 */
> /* 033 */
> /* 034 */
> /* 035 */
> /* 036 */   }
> /* 037 */
> /* 038 */   public void initialize(int partitionIndex) {
> /* 039 */
> /* 040 */   }
> /* 041 */
> /* 042 */
> /* 043 */   private void apply_4(InternalRow i) {
> /* 044 */
> /* 045 */
> /* 046 */ resultIsNull1 = false;
> /* 047 */ if (!resultIsNull1) {
> /* 048 */
> /* 049 */   boolean isNull21 = i.isNullAt(17);
> /* 050 */   UTF8String value21 = isNull21 ? null : (i.getUTF8String(17));
> /* 051 */   resultIsNull1 = isNull21;
> /* 052 */   argValue1 = value21;
> /* 053 */ }
> /* 054 */
> /* 055 */
> /* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new 
> java.lang.Long(argValue1);
> /* 057 */ javaBean.setSchemaOrgUnitId(value20);
> /* 058 */
> /* 059 */
> /* 060 */ resultIsNull2 = false;
> /* 061 */ if (!resultIsNull2) {
> /* 062 */
> /* 063 */   boolean isNull23 = i.isNullAt(0);
> /* 064 */   UTF8String value23 = isNull23 ? null : (i.getUTF8String(0));
> /* 065 */   resultIsNull2 = isNull23;
> /* 066 */   argValue2 = value23;
> /* 067 */ }
> 

[jira] [Created] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor

2017-01-31 Thread Suresh Avadhanula (JIRA)
Suresh Avadhanula created SPARK-19418:
-

 Summary: Dataset generated java code fails to compile as 
java.lang.Long does not accept UTF8String in constructor
 Key: SPARK-19418
 URL: https://issues.apache.org/jira/browse/SPARK-19418
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Suresh Avadhanula



I have the following in Java spark driver. 


DealerPerson module object is
{code:title=DealerPerson.java|borderStyle=solid}
public class DealerPerson
{
Long schemaOrgUnitId ;
List personList 
}
{code}

I populate it using group by as follows.
{code}
  Dataset dps = persondds.groupByKey(new MapFunction() {
@Override
public Long call(Person person) throws Exception {
return person.getSchemaOrgUnitId();
}
}, Encoders.LONG()).
mapGroups(new MapGroupsFunction() {
@Override
public DealerPerson call(Long dp, 
java.util.Iterator iterator) throws Exception {
DealerPerson retDp = new DealerPerson();
retDp.setSchemaOrgUnitId(dp);
ArrayList persons = new ArrayList();
while (iterator.hasNext())
persons.add(iterator.next());
retDp.setPersons(persons);
return retDp;
}
}, Encoders.bean(DealerPerson.class));
{code}

The generated code throws compiler exception since UTF8String is 
java.lang.Long() 

{noformat}
7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, 
localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes)
17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5)
17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 58: No applicable constructor/method found for actual parameters 
"org.apache.spark.unsafe.types.UTF8String"; candidates are: 
"java.lang.Long(long)", "java.lang.Long(java.lang.String)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private com.xtime.spark.model.Person javaBean;
/* 010 */   private boolean resultIsNull;
/* 011 */   private UTF8String argValue;
/* 012 */   private boolean resultIsNull1;
/* 013 */   private UTF8String argValue1;
/* 014 */   private boolean resultIsNull2;
/* 015 */   private UTF8String argValue2;
/* 016 */   private boolean resultIsNull3;
/* 017 */   private UTF8String argValue3;
/* 018 */   private boolean resultIsNull4;
/* 019 */   private UTF8String argValue4;
/* 020 */
/* 021 */   public SpecificSafeProjection(Object[] references) {
/* 022 */ this.references = references;
/* 023 */ mutableRow = (InternalRow) references[references.length - 1];
/* 024 */
/* 025 */
/* 026 */
/* 027 */
/* 028 */
/* 029 */
/* 030 */
/* 031 */
/* 032 */
/* 033 */
/* 034 */
/* 035 */
/* 036 */   }
/* 037 */
/* 038 */   public void initialize(int partitionIndex) {
/* 039 */
/* 040 */   }
/* 041 */
/* 042 */
/* 043 */   private void apply_4(InternalRow i) {
/* 044 */
/* 045 */
/* 046 */ resultIsNull1 = false;
/* 047 */ if (!resultIsNull1) {
/* 048 */
/* 049 */   boolean isNull21 = i.isNullAt(17);
/* 050 */   UTF8String value21 = isNull21 ? null : (i.getUTF8String(17));
/* 051 */   resultIsNull1 = isNull21;
/* 052 */   argValue1 = value21;
/* 053 */ }
/* 054 */
/* 055 */
/* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new 
java.lang.Long(argValue1);
/* 057 */ javaBean.setSchemaOrgUnitId(value20);
/* 058 */
/* 059 */
/* 060 */ resultIsNull2 = false;
/* 061 */ if (!resultIsNull2) {
/* 062 */
/* 063 */   boolean isNull23 = i.isNullAt(0);
/* 064 */   UTF8String value23 = isNull23 ? null : (i.getUTF8String(0));
/* 065 */   resultIsNull2 = isNull23;
/* 066 */   argValue2 = value23;
/* 067 */ }
/* 068 */
/* 069 */
/* 070 */ final java.lang.Long value22 = resultIsNull2 ? null : new 
java.lang.Long(argValue2);
/* 071 */ javaBean.setPersonId(value22);
/* 072 */
/* 073 */
/* 074 */ boolean isNull25 = i.isNullAt(10);
/* 075 */ UTF8String value25 = isNull25 ? null : (i.getUTF8String(10));
/* 076 */ boolean isNull24 = true;
/* 077 */ java.lang.String value24 = null;
/* 078 */ if (!isNull25) {
/* 079 */
/* 080 */   isNull24 = false;
/* 081 */   if (!isNull24) {
/* 082 */
/* 083 */ Object funcResult8 = 

[jira] [Updated] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19386:
-
Target Version/s: 2.2.0

> Bisecting k-means in SparkR documentation
> -
>
> Key: SPARK-19386
> URL: https://issues.apache.org/jira/browse/SPARK-19386
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Miao Wang
>
> we need updates to programming guide, example and vignettes



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19417) spark.files.overwrite is ignored

2017-01-31 Thread Chris Kanich (JIRA)
Chris Kanich created SPARK-19417:


 Summary: spark.files.overwrite is ignored
 Key: SPARK-19417
 URL: https://issues.apache.org/jira/browse/SPARK-19417
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Chris Kanich


I have not been able to get Spark to actually overwrite a file after I have 
changed it on the driver node, re-called addFile, and then used it on the 
executors again. Here's a failing test.

{code}

  test("can overwrite files when spark.files.overwrite is true") {
val dir = Utils.createTempDir()
val file = new File(dir, "file")
try {
  Files.write("one", file, StandardCharsets.UTF_8)
  sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local-cluster[1,1,1024]")
 .set("spark.files.overwrite", "true"))
  sc.addFile(file.getAbsolutePath)
  def getAddedFileContents(): String = {
sc.parallelize(Seq(0)).map { _ =>
  scala.io.Source.fromFile(SparkFiles.get("file")).mkString
}.first()
  }
  assert(getAddedFileContents() === "one")
  Files.write("two", file, StandardCharsets.UTF_8)
  sc.addFile(file.getAbsolutePath)
  assert(getAddedFileContents() === "onetwo")
} finally {
  Utils.deleteRecursively(dir)
  sc.stop()
}
  }

{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

2017-01-31 Thread Steven Ruppert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847913#comment-15847913
 ] 

Steven Ruppert commented on SPARK-19098:


Actually, the "fix" of checkpointing the RDDs only seems to work in spark 
2.0.2. Whatever is getting into kryo-serialized form is very subtle.

> Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
> -
>
> Key: SPARK-19098
> URL: https://issues.apache.org/jira/browse/SPARK-19098
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2, 2.1.0
> Environment: Linux x64
> Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
> Spark on YARN, dynamic allocation with shuffle service
> Input/Output data on HDFS
> kryo serialization turned on
> checkpointing directory set on HDFS
>Reporter: Steven Ruppert
>Priority: Minor
> Attachments: doubling-season.png, Screen Shot 2017-01-30 at 
> 18.36.43-fullpage.png
>
>
> I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
> ConnectedComponents use, notably one that works fine with identical code on 
> spark 2.0.1, but not on 2.1.0.
> I unfortunately haven't narrowed this down to a test case yet, nor do I have 
> access to the original logs, so this initial report will be a little vague. 
> However, this behavior as described might ring a bell to somebody.
> Roughly: 
> {noformat}
> val edges: RDD[Edge[Int]] = _ // from file
> val vertices: RDD[(VertexId, Int)] = _ // from file
> val graph = Graph(vertices, edges)
> val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
>   .run(graph, 10)
>   .vertices
> {noformat}
> Running this against my input of ~5B edges and ~3B vertices leads to a 
> strange doubling of shuffle traffic in each round of Pregel (inside 
> ConnectedComponents), increasing from the actual data size of ~50 GB, to 
> 100GB, to 200GB, all the way to around 40TB before I killed the job. The data 
> being shuffled was apparently an RDD of ShippableVertexPartition .
> Oddly enough, only the kryo-serialized shuffled data doubled in size. The 
> heap usage of the executors themselves remained stable, or at least did not 
> account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 
> 40TB of RAM. Furthermore, I also have kryo reference tracking turned on 
> still, so whatever is leaking somehow gets around that.
> I'll update this ticket once I have more details, unless somebody else with 
> the same problem reports back first.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:25 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id: Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://myS3Bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id: Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:24 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id: Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:25 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id: Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://myS3Bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://myS3Bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id: Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://myS3Bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:24 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line:
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I changed it to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line 
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:22 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code:scala}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line 
{code:scala}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
{code:scala}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

```
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
```

The exception only happened at the following line 
```
.map(a => MyClass(a(0).toInt, a(1)))
```
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
```
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")

```

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye edited comment on SPARK-16599 at 2/1/17 2:23 AM:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line 
{code}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
{code}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.





was (Author: yetsun):
I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

{code:scala}
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
{code}

The exception only happened at the following line 
{code:scala}
.map(a => MyClass(a(0).toInt, a(1)))
{code}
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
{code:scala}
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")
{code}

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-31 Thread Jun Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847889#comment-15847889
 ] 

Jun Ye commented on SPARK-16599:


I got the same exception with the following code: 

(My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3)

```
case class MyClass(id:Int, name: String)

val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map(a => MyClass(a(0).toInt, a(1)))
.toDF
```

The exception only happened at the following line 
```
.map(a => MyClass(a(0).toInt, a(1)))
```
If I removed this line, there is no exception. 

So I change to the following to bypass this exception.
```
val myDF = sparkSession.sparkContext
.textFile("s3a://mys3bucket/myTextFile.txt")
.map(_.split("\t"))
.map(_.map(_.trim))
.map {
  case a: Array[String] => (a(0).toInt, a(1))
}.toDF("id", "name")

```

It works for me.




> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger

2017-01-31 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19378.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.1.1

Issue resolved by pull request 16716
[https://github.com/apache/spark/pull/16716]

> StateOperator metrics should still return the total number of rows in state 
> even if there was no data for a trigger
> ---
>
> Key: SPARK-19378
> URL: https://issues.apache.org/jira/browse/SPARK-19378
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 3.0.0
>
>
> If you have a StreamingDataFrame with an aggregation, we report a metric 
> called stateOperators which consists of a list of data points per aggregation 
> for our query (With Spark 2.1, only one aggregation is supported).
> These data points report:
>  - numUpdatedStateRows
>  - numTotalStateRows
> If a trigger had no data - therefore was not fired - we return 0 data points, 
> however we should actually return a data point with
>  - numTotalStateRows: numTotalStateRows in lastExecution
>  - numUpdatedStateRows: 0
> This also affects eventTime statistics. We should still provide the min, max, 
> avg even through the data didn't change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847786#comment-15847786
 ] 

Joseph K. Bradley edited comment on SPARK-12965 at 2/1/17 12:43 AM:


I'd say this is both a SQL and MLlib issue, where the MLlib issue is blocked by 
the SQL one.
* SQL: {{schema}} handles periods/quotes inconsistently relative to the rest of 
the Dataset API
* ML: StringIndexer could avoid using schema.fieldNames and instead use an API 
provided by StructType for checking for the existence of a field.  That said, 
that API needs to be added to StructType...

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.



was (Author: josephkb):
I'd say this is both a SQL and MLlib issue.

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847786#comment-15847786
 ] 

Joseph K. Bradley commented on SPARK-12965:
---

I'd say this is both a SQL and MLlib issue.

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12965:
--
Affects Version/s: (was: 1.6.0)
   2.2.0
   1.6.3
   2.0.2
   2.1.0

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12965:
--
Component/s: (was: Spark Core)

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods

2017-01-31 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19416:
-

 Summary: Dataset.schema is inconsistent with Dataset in handling 
columns with periods
 Key: SPARK-19416
 URL: https://issues.apache.org/jira/browse/SPARK-19416
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor


When you have a DataFrame with a column with a period in its name, the API is 
inconsistent about how to quote the column name.

Here's a reproduction:
{code}
import org.apache.spark.sql.functions.col
val rows = Seq(
  ("foo", 1),
  ("bar", 2)
)
val df = spark.createDataFrame(rows).toDF("a.b", "id")
{code}

These methods are all consistent:
{code}
df.select("a.b") // fails
df.select("`a.b`") // succeeds
df.select(col("a.b")) // fails
df.select(col("`a.b`")) // succeeds
df("a.b") // fails
df("`a.b`") // succeeds
{code}

But {{schema}} is inconsistent:
{code}
df.schema("a.b") // succeeds
df.schema("`a.b`") // fails
{code}


"fails" produces error messages like:
{code}
org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input 
columns: [a.b, id];;
'Project ['a.b]
+- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517]
   +- LocalRelation [_1#1511, _2#1512]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$iw.(:34)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw.(:41)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw.(:43)
at line9667c6d14e79417280e5882aa52e0de727.$read$$iw.(:45)
at 
line9667c6d14e79417280e5882aa52e0de727.$eval$.$print$lzycompute(:7)
at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print(:6)
{code}

"succeeds" produces:
{code}
org.apache.spark.sql.DataFrame = [a.b: 

[jira] [Resolved] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss

2017-01-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19415.

   Resolution: Duplicate
Fix Version/s: 2.2.0

> Improve the implicit type conversion between numeric type and string to avoid 
> precesion loss
> 
>
> Key: SPARK-19415
> URL: https://issues.apache.org/jira/browse/SPARK-19415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.2.0
>
>
> Currently, Spark SQL will convert both numeric type and string into 
> DoubleType, if the two children of a expression does not match (for example, 
> comparing a LongType again StringType), this will cause precision loss in 
> some cases.
> Some database does better job one this (for example, SQL Server [1]), we 
> should follow that.
> [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-01-31 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847683#comment-15847683
 ] 

Miao Wang commented on SPARK-19382:
---

[~josephkb] If I understand correctly, I think we have to create separate tests 
for SparseVector. For example, assert(model.numFeatures === 2) in test("linear 
svc: default params").
If it is the DenseVector case, each Vector is size 2, which determines 
model.numFeatures = summarizer.mean.size = n = instance.size =2.

However, if I create a SparseVector of size 20 with non-zero values the same as 
the DenseVector (i.e., 2 non-zero values and 18 zero values), model.numFeatures 
= 20, based on the logic above.

Therefore, we should create separate test case for SparseVector, or we have to 
remove the test above.

test("linearSVC comparison with R e1071 and scikit-learn") also fails for all 
SparseVector case. 

Other tests pass for all  SparseVector case.

I am generating a mixed test now. 




> Test sparse vectors in LinearSVCSuite
> -
>
> Key: SPARK-19382
> URL: https://issues.apache.org/jira/browse/SPARK-19382
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
> recommend that generateSVMInput be modified to create a mix of dense and 
> sparse vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss

2017-01-31 Thread Davies Liu (JIRA)
Davies Liu created SPARK-19415:
--

 Summary: Improve the implicit type conversion between numeric type 
and string to avoid precesion loss
 Key: SPARK-19415
 URL: https://issues.apache.org/jira/browse/SPARK-19415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently, Spark SQL will convert both numeric type and string into DoubleType, 
if the two children of a expression does not match (for example, comparing a 
LongType again StringType), this will cause precision loss in some cases.

Some database does better job one this (for example, SQL Server [1]), we should 
follow that.

[1] https://msdn.microsoft.com/en-us/library/ms191530.aspx 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847608#comment-15847608
 ] 

sam elamin commented on SPARK-19414:


Perfect ill do that, or should I say attempt! 

but Im loving the structured streaming API! Keep up the amazing work!

Thanks again!

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19414.
--
Resolution: Not A Bug

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847606#comment-15847606
 ] 

Shixiong Zhu commented on SPARK-19414:
--

Yes.

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847602#comment-15847602
 ] 

sam elamin commented on SPARK-19414:


[~zsxwing] most probably yes. ah ok i think i understand what you might mean

in the source schema provider I can call bq to get the schema and override it 
at that layer?

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18872) New test cases for EXISTS subquery

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847596#comment-15847596
 ] 

Apache Spark commented on SPARK-18872:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/16760

> New test cases for EXISTS subquery
> --
>
> Key: SPARK-18872
> URL: https://issues.apache.org/jira/browse/SPARK-18872
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: Dilip Biswal
> Fix For: 2.2.0
>
>
> This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test 
> cases. It follows the same idea as the IN subquery test cases which contain 
> simple patterns, then build more complex constructs in both parent and 
> subquery sides. This batch of test cases are mostly, if not all, positive 
> test cases that do not return any syntax errors or unsupported functionality. 
> We make effort to have test cases returning rows in the result set so that 
> they can indirectly detect incorrect result problems.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847598#comment-15847598
 ] 

Shixiong Zhu commented on SPARK-19414:
--

[~samelamin] I know little about BigQuery. Does it provide an API to get the 
table schema?

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18871) New test cases for IN/NOT IN subquery

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847592#comment-15847592
 ] 

Apache Spark commented on SPARK-18871:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/16759

> New test cases for IN/NOT IN subquery
> -
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: kevin yu
> Fix For: 2.2.0
>
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847581#comment-15847581
 ] 

sam elamin commented on SPARK-19414:


Ah thanks for clarifying! 

Thats a bit of a shame, ideally id like to read a stream and infer the schema 
from it, but I guess spark.read.json isnt quite the same as 
spark.readStream.json since one is under the sql namespace! 


my solution so far is to read one record from BigQuery, infer the schema there 
then pass it down to the readStream method

Its a bit convoluted to be honest so if there is a cleaner or nicer way other 
than hardcoding the schema then happy to hear it! 

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847578#comment-15847578
 ] 

Shixiong Zhu commented on SPARK-19414:
--

Since you can get the DataFrame, the place to create this DataFrame must know 
the schema. Right?

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847574#comment-15847574
 ] 

Shixiong Zhu commented on SPARK-19414:
--

Oh, I see, You cannot infer schema when getting the DataFrame. The schema must 
be provided before running the streaming query as SQL needs it to verify the 
query.

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847571#comment-15847571
 ] 

Shixiong Zhu commented on SPARK-19414:
--

[~samelamin] I would suggest that you take a look at the Kafka source: 
https://github.com/apache/spark/tree/master/external/kafka-0-10-sql

The magic part is this file: 
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam elamin reopened SPARK-19414:


> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847565#comment-15847565
 ] 

sam elamin commented on SPARK-19414:


Thanks [~zsxwing] but I dont have access to the dataframe at the 
StreamSourceProvider, so how can I set it at that level? 

Let me explain, I am calling the source first then setting the schema from the 
returned dataframe

At the source level I have the dataframe so can override the schema, but at the 
StreamSourceProvider I dont know where to get it from?

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19414:
-
Issue Type: Question  (was: Bug)

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19414.
--
Resolution: Not A Bug

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847558#comment-15847558
 ] 

Shixiong Zhu commented on SPARK-19414:
--

You also need to override 
`org.apache.spark.sql.sources.StreamSourceProvider.sourceSchema`.

> Inferring schema in a structured streaming source
> -
>
> Key: SPARK-19414
> URL: https://issues.apache.org/jira/browse/SPARK-19414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: sam elamin
>
> Hi All
> I am writing a connector to BigQuery that uses structured streaming, my 
> question is about schemas
> I would like to be able to infer the schema from BQ rather than pass it in, 
> is there any way to overwrite the source schema in anything that extends 
> org.apache.spark.sql.execution.streaming.Source
> Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18085) Better History Server scalability for many / large applications

2017-01-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847519#comment-15847519
 ] 

Marcelo Vanzin commented on SPARK-18085:


Storage tab available at this branch:
https://github.com/vanzin/spark/tree/shs-ng/M4.3


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-01-31 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847473#comment-15847473
 ] 

Shivaram Venkataraman commented on SPARK-18131:
---

Hmm - this is tricky. We ran into a similar issue in SQL and we added a reader, 
writer object in SQL that was registered to the method in core. See 
https://github.com/apache/spark/blob/ce112cec4f9bff222aa256893f94c316662a2a7e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L39
 for how we did that. We could do a similar thing in MLlib as well ? 

cc [~mengxr]

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19414) Inferring schema in a structured streaming source

2017-01-31 Thread sam elamin (JIRA)
sam elamin created SPARK-19414:
--

 Summary: Inferring schema in a structured streaming source
 Key: SPARK-19414
 URL: https://issues.apache.org/jira/browse/SPARK-19414
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: sam elamin


Hi All

I am writing a connector to BigQuery that uses structured streaming, my 
question is about schemas

I would like to be able to infer the schema from BQ rather than pass it in, is 
there any way to overwrite the source schema in anything that extends 

org.apache.spark.sql.execution.streaming.Source



Overriding the schema method doesnt seem to work 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847452#comment-15847452
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19066 LDA doesn't set optimizer correctly
(This is also in 2.1.1 though)

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18864:
-
Comment: was deleted

(was: SPARK-19291 spark.gaussianMixture supports output log-likelihood)

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847449#comment-15847449
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19291 spark.gaussianMixture supports output log-likelihood

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19395) Convert coefficients in summary to matrix

2017-01-31 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19395.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Convert coefficients in summary to matrix
> -
>
> Key: SPARK-19395
> URL: https://issues.apache.org/jira/browse/SPARK-19395
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> The coefficients component in model summary should be 'matrix' but the 
> underlying structure is indeed list. This affects several models except for 
> 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is 
> to first unlist the coefficients returned from the callJMethod before 
> converting to matrix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847442#comment-15847442
 ] 

Felix Cheung edited comment on SPARK-18864 at 1/31/17 8:24 PM:
---

SPARK-19395 changes to summary format for coefficients 


was (Author: felixcheung):
SPARK-19395 changes to summary format

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-01-31 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847442#comment-15847442
 ] 

Felix Cheung commented on SPARK-18864:
--

SPARK-19395 changes to summary format

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-01-31 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847391#comment-15847391
 ] 

Miao Wang commented on SPARK-19382:
---

I can try to submit a PR today or tomorrow. Thanks!

> Test sparse vectors in LinearSVCSuite
> -
>
> Key: SPARK-19382
> URL: https://issues.apache.org/jira/browse/SPARK-19382
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
> recommend that generateSVMInput be modified to create a mix of dense and 
> sparse vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18131) Support returning Vector/Dense Vector from backend

2017-01-31 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847390#comment-15847390
 ] 

Miao Wang commented on SPARK-18131:
---

[~felixcheung][~yanboliang][~shivaram] I am trying to add the serialization 
utility to SerDe.scala. 

Inside def writeObject(dos: DataOutputStream, obj: Object, jvmObjectTracker: 
JVMObjectTracker): Unit, one case should be added:

case v: org.apache.spark.ml.linalg.DenseVector => 

This file is in spark-core. So I can't import org.apache.spark.ml.linalg._ in 
this file, because of dependency issue. Do you have any suggestions?

One possibility is to move `Vectors` from mllib-local to core folder. I am not 
sure whether there are other options.

Thanks! 

> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19413) Basic mapGroupsWithState API

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847356#comment-15847356
 ] 

Apache Spark commented on SPARK-19413:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16758

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19413) Basic mapGroupsWithState API

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19413:


Assignee: Tathagata Das  (was: Apache Spark)

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19413) Basic mapGroupsWithState API

2017-01-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19413:


Assignee: Apache Spark  (was: Tathagata Das)

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-01-31 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-19067:
--
Summary: mapGroupsWithState - arbitrary stateful operations with Structured 
Streaming (similar to DStream.mapWithState)  (was: mapWithState Style API)

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait State[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> State[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-19413) Basic mapGroupsWithState API

2017-01-31 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-19413:
-

 Summary: Basic mapGroupsWithState API
 Key: SPARK-19413
 URL: https://issues.apache.org/jira/browse/SPARK-19413
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder

2017-01-31 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847242#comment-15847242
 ] 

Wes McKinney commented on SPARK-10057:
--

I have been having a really horrible time trying to get Spark SQL to emit log 
messages when run from PySpark

{code}
$ pyspark
Python 3.5.3 | packaged by conda-forge | (default, Jan 23 2017, 19:01:48) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
17/01/31 12:59:35 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/01/31 12:59:35 WARN Utils: Your hostname, badgerpad15 resolves to a loopback 
address: 127.0.0.1; using 172.31.22.23 instead (on interface wlan0)
17/01/31 12:59:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
  /_/

Using Python version 3.5.3 (default, Jan 23 2017 19:01:48)
SparkSession available as 'spark'.
>>> df = sqlContext.read.parquet('example.parquet')
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
>>> 
{code}

After this, any logging messages in Spark SQL that I'd like to see are 
suppressed

This is on vanilla Spark trunk. I'm not sure if this is the same bug, but can 
anyone give any advance? 

> Faill to load class org.slf4j.impl.StaticLoggerBinder
> -
>
> Key: SPARK-10057
> URL: https://issues.apache.org/jira/browse/SPARK-10057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Davies Liu
>
> Some loggings are dropped, because it can't load class 
> "org.slf4j.impl.StaticLoggerBinder"
> {code}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2017-01-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847103#comment-15847103
 ] 

Apache Spark commented on SPARK-18609:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16757

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> 

[jira] (SPARK-19407) defaultFS is used FileSystem.get instead of getting it from uri scheme

2017-01-31 Thread Amit Assudani (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Amit Assudani commented on  SPARK-19407 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: defaultFS is used FileSystem.get instead of getting it from uri scheme  
 
 
 
 
 
 
 
 
 
 
I'll do it over the weekend, you may assign it to me. I may need some logistics support as it will be my first PR. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19412) HiveContext hive queries differ from CLI hive queries

2017-01-31 Thread Enrique Sainz Baixauli (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Enrique Sainz Baixauli updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19412 
 
 
 
  HiveContext hive queries differ from CLI hive queries  
 
 
 
 
 
 
 
 
 

Change By:
 
 Enrique Sainz Baixauli 
 
 
 
 
 
 
 
 
 
 Hi,I've been wandering around trying to find a report related to this, but haven't found such so far. So here it is.I have found two differences between Hive queries made from SparkSQL and from Hive CLI (or beeswax or Hue):First: a query against a single partitioned table (although other, non-partitioned tables seem to work well) returns all column values in the first column, and null in the rest of them (field separator is '\t'):{code}scala > sqlContext.sql("use [database_name])scala> val t1 = sqlContext.sql("select * from [table_name]")scala> t1.first.get(0)res9: Any = "0001 1 [...]"scala> t1.first.get(1)res10: Any = null{code}And the same for the following column indexes from 1 on. However, the same query in Hive CLI/Beeswax/Hue shows all columns filled and no null values at the end.Second: a join between two partitioned tables returns an empty DataFrame when the same query in Hive returns rows filled with values:{code}scala> val join = sqlContext.sql("select * from [table_1_name] t1 join [table_2_name] t2 on t1.id = t2.id where t2.data_timestamp_part = '1485428556075'")scala> join.first[...]java.util.NoSuchElementException: next on empty iterator{code}These two issues happen in Spark 1.5 (CDH 5.5.1) but seem to be solved in 1.6 (CDH 5.7.0). Is there a chance of backporting this fix? And is there a bug report that I can't find where this was reported before?Thank you all! P.S: in this test, even if both tables are partitioned, the table in the first issue and the first table in the join, which are the same, only have one partition. That's why there is no where clause regarding partitions 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
 

[jira] (SPARK-19412) HiveContext hive queries differ from CLI hive queries

2017-01-31 Thread Enrique Sainz Baixauli (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Enrique Sainz Baixauli created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19412 
 
 
 
  HiveContext hive queries differ from CLI hive queries  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Affects Versions:
 

 1.5.1 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 SQL 
 
 
 

Created:
 

 31/Jan/17 14:29 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Enrique Sainz Baixauli 
 
 
 
 
 
 
 
 
 
 
Hi, 
I've been wandering around trying to find a report related to this, but haven't found such so far. So here it is. 
I have found two differences between Hive queries made from SparkSQL and from Hive CLI (or beeswax or Hue): 
First: a query against a single partitioned table (although other, non-partitioned tables seem to work well) returns all column values in the first column, and null in the rest of them (field separator is '\t'): 

 
scala > sqlContext.sql("use [database_name])
scala> val t1 = sqlContext.sql("select * from [table_name]")
scala> t1.first.get(0)
res9: Any = "0001	1	[...]"
scala> t1.first.get(1)
res10: Any = null 

 
And the same for 

[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark commented on  SPARK-19411 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown  
 
 
 
 
 
 
 
 
 
 
User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16756 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Apache Spark 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19411 
 
 
 
  Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Unassigned 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19411 
 
 
 
  Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown

2017-01-31 Thread Liang-Chi Hsieh (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Liang-Chi Hsieh created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19411 
 
 
 
  Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Improvement 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Created:
 

 31/Jan/17 13:46 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Liang-Chi Hsieh 
 
 
 
 
 
 
 
 
 
 
There is a metadata introduced before to mark the optional columns in merged Parquet schema for filter predicate pushdown. As we upgrade to Parquet 1.8.2 which includes the fix for the pushdown of optional columns, we don't need this metadata now. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
  

[jira] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-01-31 Thread Hamel Ajay Kothari (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Hamel Ajay Kothari commented on  SPARK-18278 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Support native submission of spark jobs to a kubernetes cluster  
 
 
 
 
 
 
 
 
 
 
Hi all, the pluggable scheduler sounds pretty valuable for my workflow as well. We've got our own scheduler where I work that we need to support in spark and currently that requires us to maintain an entire fork that we must keep up to date. If we could get a pluggable scheduler that would reduce our maintenance burden significantly. 
Has a JIRA been filed for this? I'd be happy to contribute some work on it. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread zhengruifeng (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 zhengruifeng edited a comment on  SPARK-19410 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Links to API documentation are broken  
 
 
 
 
 
 
 
 
 
 Actually, excecpt {{ml-tuning.md}} and {{ml-pipeline.md}}, all markdown file use {{}}.Only those broken links are in {{}} {code}egrep 'data\-lang=\"[a-z]+\">' docs/*md -ndocs/ml-pipeline.md:209:docs/ml-pipeline.md:218:docs/ml-pipeline.md:227:docs/ml-pipeline.md:244:docs/ml-pipeline.md:251:docs/ml-pipeline.md:259:docs/ml-tuning.md:77:docs/ml-tuning.md:84:docs/ml-tuning.md:91:docs/ml-tuning.md:131:{code} 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Apache Spark 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19410 
 
 
 
  Links to API documentation are broken  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Unassigned 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19410 
 
 
 
  Links to API documentation are broken  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark commented on  SPARK-19410 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Links to API documentation are broken  
 
 
 
 
 
 
 
 
 
 
User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16754 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread zhengruifeng (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 zhengruifeng commented on  SPARK-19410 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Links to API documentation are broken  
 
 
 
 
 
 
 
 
 
 
Actually, excecpt ml-tuning.md and ml-pipeline.md, all markdown file use . Only those broken links are in  
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread zhengruifeng (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 zhengruifeng commented on  SPARK-19410 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Links to API documentation are broken  
 
 
 
 
 
 
 
 
 
 
Sean Owen This also happens in ml-tuning.md. In TrainValidationSplit, the links for scala and java are ok, but the one for python don't work. After comparing the right links and the broken ones, I found that this should be caused by the div: links in  are rendered correctly, but links in  are broken. Thanks for ping me. I will create a pr to fix this.  
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Sean Owen (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Sean Owen updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19410 
 
 
 
  Links to API documentation are broken  
 
 
 
 
 
 
 
 
 
 
Hm, sure enough. I don't see an obvious problem with the markdown. These were added in https://issues.apache.org/jira/browse/SPARK-18446 yet other apparently identical links added there seem to render correctly. 
zhengruifeng I'm a little stumped. Do you have any ideas? Still looking at the markdown here. 
 
 
 
 
 
 
 
 
 

Change By:
 
 Sean Owen 
 
 
 

Priority:
 
 Major Trivial 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Reynold Xin (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Reynold Xin resolved as Fixed 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19409 
 
 
 
  Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 

Change By:
 
 Reynold Xin 
 
 
 

Resolution:
 
 Fixed 
 
 
 

Assignee:
 
 Dongjoon Hyun 
 
 
 

Fix Version/s:
 
 2.2.0 
 
 
 

Status:
 
 In Progress Resolved 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19394) "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign

2017-01-31 Thread Sean Owen (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Sean Owen resolved as Not A Problem 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
I think that's the answer then. This is a corner case, and the hostname the machine is returning is not actually valid, as per RFC: https://en.wikipedia.org/wiki/Hostname 
 
 
 
 
 
 
 
 
 
 Spark /  SPARK-19394 
 
 
 
  "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign  
 
 
 
 
 
 
 
 
 

Change By:
 
 Sean Owen 
 
 
 

Resolution:
 
 Not A Problem 
 
 
 

Status:
 
 Open Resolved 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Unassigned 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19296 
 
 
 
  Awkward changes for JdbcUtils.saveTable in Spark 2.1.0  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Apache Spark 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19296 
 
 
 
  Awkward changes for JdbcUtils.saveTable in Spark 2.1.0  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark commented on  SPARK-19296 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Awkward changes for JdbcUtils.saveTable in Spark 2.1.0  
 
 
 
 
 
 
 
 
 
 
User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16753 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Aseem Bansal (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Aseem Bansal created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19410 
 
 
 
  Links to API documentation are broken  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Documentation 
 
 
 

Affects Versions:
 

 2.1.0 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 Documentation 
 
 
 

Created:
 

 31/Jan/17 08:55 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Aseem Bansal 
 
 
 
 
 
 
 
 
 
 
I was looking at https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param and saw that the links to API documentation are broken 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 

[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Unassigned 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19409 
 
 
 
  Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark commented on  SPARK-19409 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 
 
User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16751 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Dongjoon Hyun (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Dongjoon Hyun updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19409 
 
 
 
  Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 

Change By:
 
 Dongjoon Hyun 
 
 
 
 
 
 
 
 
 
 Apache Parquet 1.8.2 is released officially last week on 26 Jan.This issue aims to bump Parquet version to 1.8.2  since it includes many fixes .https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Apache Spark 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19409 
 
 
 
  Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-19409) Upgrade Parquet to 1.8.2

2017-01-31 Thread Dongjoon Hyun (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Dongjoon Hyun created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19409 
 
 
 
  Upgrade Parquet to 1.8.2  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Bug 
 
 
 

Affects Versions:
 

 2.1.0 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 Build 
 
 
 

Created:
 

 31/Jan/17 08:41 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Dongjoon Hyun 
 
 
 
 
 
 
 
 
 
 
Apache Parquet 1.8.2 is released officially last week on 26 Jan. This issue aims to bump Parquet version to 1.8.2. 
https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
   

[jira] (SPARK-18937) Timezone support in CSV/JSON parsing

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark commented on  SPARK-18937 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Timezone support in CSV/JSON parsing  
 
 
 
 
 
 
 
 
 
 
User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/16750 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-18937) Timezone support in CSV/JSON parsing

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Unassigned 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-18937 
 
 
 
  Timezone support in CSV/JSON parsing  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (SPARK-18937) Timezone support in CSV/JSON parsing

2017-01-31 Thread Apache Spark (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Apache Spark assigned an issue to Apache Spark 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-18937 
 
 
 
  Timezone support in CSV/JSON parsing  
 
 
 
 
 
 
 
 
 

Change By:
 
 Apache Spark 
 
 
 

Assignee:
 
 Apache Spark 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



  1   2   >