date:20160102

[jira] [Commented] (SPARK-12608) Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076777#comment-15076777
 ] 

Apache Spark commented on SPARK-12608:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10560

> Remove submitJobThreadPool since submitJob doesn't create a separate thread 
> to wait for the job result
> --
>
> Key: SPARK-12608
> URL: https://issues.apache.org/jira/browse/SPARK-12608
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> Before [#9264|https://github.com/apache/spark/pull/9264], submitJob would 
> create a separate thread to wait for the job result. `submitJobThreadPool` 
> was a workaround in `ReceiverTracker` to run these waiting-job-result 
> threads. Now [#9264|https://github.com/apache/spark/pull/9264] has been 
> merged to master and resolved this blocking issue, `submitJobThreadPool` can 
> be removed now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12608) Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12608:


Assignee: Apache Spark

> Remove submitJobThreadPool since submitJob doesn't create a separate thread 
> to wait for the job result
> --
>
> Key: SPARK-12608
> URL: https://issues.apache.org/jira/browse/SPARK-12608
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Before [#9264|https://github.com/apache/spark/pull/9264], submitJob would 
> create a separate thread to wait for the job result. `submitJobThreadPool` 
> was a workaround in `ReceiverTracker` to run these waiting-job-result 
> threads. Now [#9264|https://github.com/apache/spark/pull/9264] has been 
> merged to master and resolved this blocking issue, `submitJobThreadPool` can 
> be removed now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12608) Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12608:


Assignee: (was: Apache Spark)

> Remove submitJobThreadPool since submitJob doesn't create a separate thread 
> to wait for the job result
> --
>
> Key: SPARK-12608
> URL: https://issues.apache.org/jira/browse/SPARK-12608
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> Before [#9264|https://github.com/apache/spark/pull/9264], submitJob would 
> create a separate thread to wait for the job result. `submitJobThreadPool` 
> was a workaround in `ReceiverTracker` to run these waiting-job-result 
> threads. Now [#9264|https://github.com/apache/spark/pull/9264] has been 
> merged to master and resolved this blocking issue, `submitJobThreadPool` can 
> be removed now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12608) Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

2016-01-02 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12608:


 Summary: Remove submitJobThreadPool since submitJob doesn't create 
a separate thread to wait for the job result
 Key: SPARK-12608
 URL: https://issues.apache.org/jira/browse/SPARK-12608
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu


Before [#9264|https://github.com/apache/spark/pull/9264], submitJob would 
create a separate thread to wait for the job result. `submitJobThreadPool` was 
a workaround in `ReceiverTracker` to run these waiting-job-result threads. Now 
[#9264|https://github.com/apache/spark/pull/9264] has been merged to master and 
resolved this blocking issue, `submitJobThreadPool` can be removed now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12600) Remove deprecated methods in SQL / DataFrames

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076772#comment-15076772
 ] 

Apache Spark commented on SPARK-12600:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10559

> Remove deprecated methods in SQL / DataFrames
> -
>
> Key: SPARK-12600
> URL: https://issues.apache.org/jira/browse/SPARK-12600
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2016-01-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076770#comment-15076770
 ] 

Reynold Xin commented on SPARK-12286:
-

[~davies] can you close the rest if they no longer apply? Thanks.


> Support UnsafeRow in all SparkPlan (if possible)
> 
>
> Key: SPARK-12286
> URL: https://issues.apache.org/jira/browse/SPARK-12286
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> There are still some SparkPlan does not support UnsafeRow (or does not 
> support well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2016-01-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12286.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Support UnsafeRow in all SparkPlan (if possible)
> 
>
> Key: SPARK-12286
> URL: https://issues.apache.org/jira/browse/SPARK-12286
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> There are still some SparkPlan does not support UnsafeRow (or does not 
> support well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2016-01-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076769#comment-15076769
 ] 

Reynold Xin commented on SPARK-12537:
-

I don't really care as much whether the default value should be true or false. 
My primary objection was to Sean's point that we shouldn't have this option at 
all.

[~Cazen] let's just change the default to false and merge it so we don't need 
to spend time arguing about this.


> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12597) Use udf to replace callUDF for ML

2016-01-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12597:

Fix Version/s: 2.0.0

> Use udf to replace callUDF for ML
> -
>
> Key: SPARK-12597
> URL: https://issues.apache.org/jira/browse/SPARK-12597
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> callUDF has been deprecated and will be removed in Spark 2.0. We should 
> replace the use of callUDF with udf for ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12599) Remove the use of the deprecated callUDF in MLlib

2016-01-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12599.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove the use of the deprecated callUDF in MLlib
> -
>
> Key: SPARK-12599
> URL: https://issues.apache.org/jira/browse/SPARK-12599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> MLlib's Transformer uses the deprecated callUDF API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12602) Join Reordering: Pushing Inner Join Through Outer Join

2016-01-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12602:

Description: 
If applicable, we can push Inner Join through Outer Join. The basic idea is 
built on the associativity property of outer and inner joins:
{code}
R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = 
(R1 inner R3 on p13) left R2 on p23
(R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
(R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = 
(R2 inner R3 on p23) left R1 on p12
{code}

The reordering can reduce the number of processed rows since the Inner Join 
always can generate less (or equivalent) rows than Left/Right Outer Join. This 
change can improve the query performance in most cases.

When cost-based optimization is available, we can switch the order of tables in 
each join type based on their costs. The order of joined tables in the inner 
join does not affect the results and the right outer join can be changed to the 
left outer join. This part is out of scope here.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan is like
{code}
Join LeftOuter, Some((int#3 = int#9))
:- Join Inner, Some((int#15 = int#9))
:  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
:  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
+- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
{code}

  was:
If applicable, we can push Inner Join through Outer Join. 

The reordering can reduce the number of processed rows since the `Inner Join` 
always can generate less rows than `Left/Right Outer Join`. Thus, it can 
improve the query performance.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan should be like
{code}
Join RightOuter, Some((int#3 = int#9))
:- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
+- Join Inner, Some((int#15 = int#9))
   :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
   +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}


> Join Reordering: Pushing Inner Join Through Outer Join
> --
>
> Key: SPARK-12602
> URL: https://issues.apache.org/jira/browse/SPARK-12602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is 
> built on the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = 
> (R1 inner R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = 
> (R2 inner R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join 
> always can generate less (or equivalent) rows than Left/Right Outer Join. 
> This change can improve the query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables 
> in each join type based on their costs. The order of joined tables in the 
> inner join does not affect the results and the right outer join can be 
> changed to the left outer join. This part is out of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],

[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2016-01-02 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076704#comment-15076704
 ] 

Mark Hamstra commented on SPARK-6416:
-

I still don't see RDD#fold as being out of bounds with what should be expected 
from the Scala parallel collections model -- there, too, you can get confusing 
results if you don't pay attention to the partitioned nature of the operation:
{code}
scala> val list1 = (1 to 1).toList

scala> val list2 = (1 to 100).toList

scala> list1.fold(0){ case (a, b) => a + 1 }
res0: Int = 1

scala> list1.par.fold(0){ case (a, b) => a + 1 }
res1: Int = 162

scala> list2.fold(0){ case (a, b) => a + 1 }
res2: Int = 100

scala> list2.par.fold(0){ case (a, b) => a + 1 }
res3: Int = 7
{code}

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12579) User-specified JDBC driver should always take precedence

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12579:


Assignee: Apache Spark  (was: Josh Rosen)

> User-specified JDBC driver should always take precedence
> 
>
> Key: SPARK-12579
> URL: https://issues.apache.org/jira/browse/SPARK-12579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Spark SQL's JDBC data source allows users to specify an explicit JDBC driver 
> to load using the {{driver}} argument, but in the current code it's possible 
> that the user-specified driver will not be used when it comes time to 
> actually create a JDBC connection.
> In a nutshell, the problem is that you might have multiple JDBC drivers on 
> your classpath that claim to be able to handle the same subprotocol and there 
> doesn't seem to be an intuitive way to control which of those drivers takes 
> precedence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12579) User-specified JDBC driver should always take precedence

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12579:


Assignee: Josh Rosen  (was: Apache Spark)

> User-specified JDBC driver should always take precedence
> 
>
> Key: SPARK-12579
> URL: https://issues.apache.org/jira/browse/SPARK-12579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark SQL's JDBC data source allows users to specify an explicit JDBC driver 
> to load using the {{driver}} argument, but in the current code it's possible 
> that the user-specified driver will not be used when it comes time to 
> actually create a JDBC connection.
> In a nutshell, the problem is that you might have multiple JDBC drivers on 
> your classpath that claim to be able to handle the same subprotocol and there 
> doesn't seem to be an intuitive way to control which of those drivers takes 
> precedence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12607) spark-class produced null command strings for "exec"

2016-01-02 Thread SM Wang (JIRA)

SM Wang created SPARK-12607:
---

 Summary: spark-class produced null command strings for "exec"
 Key: SPARK-12607
 URL: https://issues.apache.org/jira/browse/SPARK-12607
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.5.2, 1.4.1, 1.4.0
 Environment: MSYS64 on Windows 7 64 bit
Reporter: SM Wang


When using the run-example script in 1.4.0 to run the SparkPi example, I found 
that it did not print any text to the terminal (e.g., stdout, stderr). After 
further investigation I found the while loop for producing the exec command 
from the launcher class produced a null command array.

This discrepancy was observed on 1.5.2 and 1.4.1.  The 1.3.1's behavior seems 
to be correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076684#comment-15076684
 ] 

Apache Spark commented on SPARK-10359:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10558

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10963) Make KafkaCluster api public

2016-01-02 Thread Youcef HILEM (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076669#comment-15076669
 ] 

Youcef HILEM commented on SPARK-10963:
--

Cloudera Oryx project uses KafkaCluster API :
https://github.com/OryxProject/oryx/blob/master/framework/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/AbstractSparkLayer.java#L241

Another java sample at 
http://blog.csdn.net/rongyongfeikai2/article/details/49784785



> Make KafkaCluster api public
> 
>
> Key: SPARK-10963
> URL: https://issues.apache.org/jira/browse/SPARK-10963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Priority: Minor
>
> per mailing list discussion, theres enough interest in people using 
> KafkaCluster (e.g. to access latest offsets) to justify making it public



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12528) Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) public API

2016-01-02 Thread Youcef HILEM (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076662#comment-15076662
 ] 

Youcef HILEM commented on SPARK-12528:
--

You mean : This PR is relevant: 
https://issues.apache.org/jira/browse/SPARK-5388 Provide a stable application 
submission gateway in standalone cluster mode

> Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) 
> public API
> ---
>
> Key: SPARK-12528
> URL: https://issues.apache.org/jira/browse/SPARK-12528
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Youcef HILEM
>Priority: Minor
>
> Spark has a hidden REST API which handles application submission, status 
> checking and cancellation (https://issues.apache.org/jira/browse/SPARK-5388).
> There is enough interest using this API to justify making it public :
> - https://github.com/ywilkof/spark-jobs-rest-client
> - https://github.com/yohanliyanage/jenkins-spark-deploy
> - https://github.com/spark-jobserver/spark-jobserver
> - http://stackoverflow.com/questions/28992802/triggering-spark-jobs-with-rest
> - http://stackoverflow.com/questions/34225879/how-to-submit-a-job-via-rest-api
> - http://arturmkrtchyan.com/apache-spark-hidden-rest-api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-02 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-7683:
--
Comment: was deleted

(was: [~srowen] Do you have any example how it could break existing code? In 
Scala it is pretty obvious but it looks like the current implementation 
isolates Python RDDs from the effects of modifying mutable elements in place. )

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Priority: Minor
>  Labels: releasenotes
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-02 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076650#comment-15076650
 ] 

Maciej Szymkiewicz commented on SPARK-7683:
---

[~srowen] Do you have any example how it could break existing code? In Scala it 
is pretty obvious but it looks like the current implementation isolates Python 
RDDs from the effects of modifying mutable elements in place. 

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Priority: Minor
>  Labels: releasenotes
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12606:
---

 Summary: Scala/Java compatibility issue Re: how to extend java 
transformer from Scala UnaryTransformer ?
 Key: SPARK-12606
 URL: https://issues.apache.org/jira/browse/SPARK-12606
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.2
 Environment: Java 8, Mac OS, Spark-1.5.2
Reporter: Andrew Davidson



Hi Andy,

I suspect that you hit the Scala/Java compatibility issue, I can also reproduce 
this issue, so could you file a JIRA to track this issue?

Yanbo

2016-01-02 3:38 GMT+08:00 Andy Davidson :
I am trying to write a trivial transformer I use use in my pipeline. I am using 
java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as 
an example. This should be very easy how ever I do not understand Scala, I am 
having trouble debugging the following exception.

Any help would be greatly appreciated.

Happy New Year

Andy

java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
at org.apache.spark.ml.param.Params$class.set(params.scala:436)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at org.apache.spark.ml.param.Params$class.set(params.scala:422)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at 
org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)



public class StemmerTest extends AbstractSparkTest {
@Test
public void test() {
Stemmer stemmer = new Stemmer()
.setInputCol("raw”) //line 30
.setOutputCol("filtered");
}
}

/**
 * @ see 
spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
 * @ see 
https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
 * @ see 
http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
 * 
 * @author andrewdavidson
 *
 */
public class Stemmer extends UnaryTransformer, List, 
Stemmer> implements Serializable{
static Logger logger = LoggerFactory.getLogger(Stemmer.class);
private static final long serialVersionUID = 1L;
private static final  ArrayType inputType = 
DataTypes.createArrayType(DataTypes.StringType, true);
private final String uid = Stemmer.class.getSimpleName() + "_" + 
UUID.randomUUID().toString();

@Override
public String uid() {
return uid;
}

/*
   override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got 
$inputType.")
  }
 */
@Override
public void validateInputType(DataType inputTypeArg) {
String msg = "inputType must be " + inputType.simpleString() + " but 
got " + inputTypeArg.simpleString();
assert (inputType.equals(inputTypeArg)) : msg; 
}

@Override
public Function1, List> createTransformFunc() {
// 
http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
Function1, List> f = new 
AbstractFunction1, List>() {
public List apply(List words) {
for(String word : words) {
logger.error("AEDWIP input word: {}", word);
}
return words;
}
};

return f;
}

@Override
public DataType outputDataType() {
return DataTypes.createArrayType(DataTypes.StringType, true);
}
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12578) Parser should not silently ignore the distinct keyword used in an aggregate function when OVER clause is used

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076631#comment-15076631
 ] 

Apache Spark commented on SPARK-12578:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10557

> Parser should not silently ignore the distinct keyword used in an aggregate 
> function when OVER clause is used
> -
>
> Key: SPARK-12578
> URL: https://issues.apache.org/jira/browse/SPARK-12578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Right now, when an aggregate function is used as window function and DISTINCT 
> is used, Hive's parser silently drop the DISTINCT keyword. It is fine to not 
> support DISTINCT aggregation in window function. But, it is not good to 
> silently drop the keyword.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12578) Parser should not silently ignore the distinct keyword used in an aggregate function when OVER clause is used

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12578:


Assignee: (was: Apache Spark)

> Parser should not silently ignore the distinct keyword used in an aggregate 
> function when OVER clause is used
> -
>
> Key: SPARK-12578
> URL: https://issues.apache.org/jira/browse/SPARK-12578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Right now, when an aggregate function is used as window function and DISTINCT 
> is used, Hive's parser silently drop the DISTINCT keyword. It is fine to not 
> support DISTINCT aggregation in window function. But, it is not good to 
> silently drop the keyword.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12578) Parser should not silently ignore the distinct keyword used in an aggregate function when OVER clause is used

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12578:


Assignee: Apache Spark

> Parser should not silently ignore the distinct keyword used in an aggregate 
> function when OVER clause is used
> -
>
> Key: SPARK-12578
> URL: https://issues.apache.org/jira/browse/SPARK-12578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Right now, when an aggregate function is used as window function and DISTINCT 
> is used, Hive's parser silently drop the DISTINCT keyword. It is fine to not 
> support DISTINCT aggregation in window function. But, it is not good to 
> silently drop the keyword.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12605) Pushing Join Predicates Through Union All

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12605:


Assignee: (was: Apache Spark)

> Pushing Join Predicates Through Union All
> -
>
> Key: SPARK-12605
> URL: https://issues.apache.org/jira/browse/SPARK-12605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> When selectivity of Join predicates is high, we can push join through union 
> all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12605) Pushing Join Predicates Through Union All

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076595#comment-15076595
 ] 

Apache Spark commented on SPARK-12605:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10556

> Pushing Join Predicates Through Union All
> -
>
> Key: SPARK-12605
> URL: https://issues.apache.org/jira/browse/SPARK-12605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> When selectivity of Join predicates is high, we can push join through union 
> all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12605) Pushing Join Predicates Through Union All

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12605:


Assignee: Apache Spark

> Pushing Join Predicates Through Union All
> -
>
> Key: SPARK-12605
> URL: https://issues.apache.org/jira/browse/SPARK-12605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When selectivity of Join predicates is high, we can push join through union 
> all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12605) Pushing Join Predicates Through Union All

2016-01-02 Thread Xiao Li (JIRA)

Xiao Li created SPARK-12605:
---

 Summary: Pushing Join Predicates Through Union All
 Key: SPARK-12605
 URL: https://issues.apache.org/jira/browse/SPARK-12605
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li


When selectivity of Join predicates is high, we can push join through union all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12528) Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) public API

2016-01-02 Thread Jim Lohse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076572#comment-15076572
 ] 

Jim Lohse commented on SPARK-12528:
---

This PR is relevant: https://issues.apache.org/jira/browse/SPARK-12528 Provide 
a stable application submission gateway in standalone cluster mode

> Make Apache Spark’s gateway hidden REST API (in standalone cluster mode) 
> public API
> ---
>
> Key: SPARK-12528
> URL: https://issues.apache.org/jira/browse/SPARK-12528
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Youcef HILEM
>Priority: Minor
>
> Spark has a hidden REST API which handles application submission, status 
> checking and cancellation (https://issues.apache.org/jira/browse/SPARK-5388).
> There is enough interest using this API to justify making it public :
> - https://github.com/ywilkof/spark-jobs-rest-client
> - https://github.com/yohanliyanage/jenkins-spark-deploy
> - https://github.com/spark-jobserver/spark-jobserver
> - http://stackoverflow.com/questions/28992802/triggering-spark-jobs-with-rest
> - http://stackoverflow.com/questions/34225879/how-to-submit-a-job-via-rest-api
> - http://arturmkrtchyan.com/apache-spark-hidden-rest-api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10963) Make KafkaCluster api public

2016-01-02 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076536#comment-15076536
 ] 

Cody Koeninger commented on SPARK-10963:


There's a nonzero chance that the 0.9 integration will be a totally
separate subproject.



> Make KafkaCluster api public
> 
>
> Key: SPARK-10963
> URL: https://issues.apache.org/jira/browse/SPARK-10963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Cody Koeninger
>Priority: Minor
>
> per mailing list discussion, theres enough interest in people using 
> KafkaCluster (e.g. to access latest offsets) to justify making it public



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12478:
--
Target Version/s: 1.6.1, 2.0.0  (was: 2.0.0)

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
> at 
> org.apache.spa

[jira] [Resolved] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8119.
--
  Resolution: Fixed
Target Version/s:   (was: 1.4.2)

I don't think this will be back-ported to 1.4.x at this point

> HeartbeatReceiver should not adjust application executor resources
> --
>
> Key: SPARK-8119
> URL: https://issues.apache.org/jira/browse/SPARK-8119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> DynamicAllocation will set the total executor to a little number when it 
> wants to kill some executors.
> But in no-DynamicAllocation scenario, Spark will also set the total executor.
> So it will cause such problem: sometimes an executor fails down, there is no 
> more executor which will be pull up by spark.
> === EDIT by andrewor14 ===
> The issue is that the AM forgets about the original number of executors it 
> wants after calling sc.killExecutor. Even if dynamic allocation is not 
> enabled, this is still possible because of heartbeat timeouts.
> I think the problem is that sc.killExecutor is used incorrectly in 
> HeartbeatReceiver. The intention of the method is to permanently adjust the 
> number of executors the application will get. In HeartbeatReceiver, however, 
> this is used as a best-effort mechanism to ensure that the timed out executor 
> is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8119:
-
Labels:   (was: backport-needed)

> HeartbeatReceiver should not adjust application executor resources
> --
>
> Key: SPARK-8119
> URL: https://issues.apache.org/jira/browse/SPARK-8119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> DynamicAllocation will set the total executor to a little number when it 
> wants to kill some executors.
> But in no-DynamicAllocation scenario, Spark will also set the total executor.
> So it will cause such problem: sometimes an executor fails down, there is no 
> more executor which will be pull up by spark.
> === EDIT by andrewor14 ===
> The issue is that the AM forgets about the original number of executors it 
> wants after calling sc.killExecutor. Even if dynamic allocation is not 
> enabled, this is still possible because of heartbeat timeouts.
> I think the problem is that sc.killExecutor is used incorrectly in 
> HeartbeatReceiver. The intention of the method is to permanently adjust the 
> number of executors the application will get. In HeartbeatReceiver, however, 
> this is used as a best-effort mechanism to ensure that the timed out executor 
> is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6197) handle json parse exception for eventlog file not finished writing

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6197:
-
Labels:   (was: backport-needed)

I don't think this will ever be back-ported to 1.3.x at this point

> handle json parse exception for eventlog file not finished writing 
> ---
>
> Key: SPARK-6197
> URL: https://issues.apache.org/jira/browse/SPARK-6197
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is a following JIRA for 
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In  
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can 
> display event log files that with suffix *.inprogress*. However, the eventlog 
> file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In 
> which case, the file maybe  truncated in the last line, leading to the line 
> being not in valid Json format. Which will cause Json parse exception when 
> reading the file. 
> For this case, we can just ignore the last line content, since the history 
> for abnormal cases showed on web is only a reference for user, it can 
> demonstrate the past status of the app before terminated abnormally (we can 
> not guarantee the history can show exactly the last moment when app encounter 
> the abnormal situation). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6197) handle json parse exception for eventlog file not finished writing

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6197.
--
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s:   (was: 1.3.2, 1.4.2)

> handle json parse exception for eventlog file not finished writing 
> ---
>
> Key: SPARK-6197
> URL: https://issues.apache.org/jira/browse/SPARK-6197
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is a following JIRA for 
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107]. In  
> [SPARK-6107|https://issues.apache.org/jira/browse/SPARK-6107], webUI can 
> display event log files that with suffix *.inprogress*. However, the eventlog 
> file may be not finished writing for some abnormal cases (e.g. Ctrl+C), In 
> which case, the file maybe  truncated in the last line, leading to the line 
> being not in valid Json format. Which will cause Json parse exception when 
> reading the file. 
> For this case, we can just ignore the last line content, since the history 
> for abnormal cases showed on web is only a reference for user, it can 
> demonstrate the past status of the app before terminated abnormally (we can 
> not guarantee the history can show exactly the last moment when app encounter 
> the abnormal situation). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8447) Test external shuffle service with all shuffle managers

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8447:
-
Issue Type: Improvement  (was: Bug)

> Test external shuffle service with all shuffle managers
> ---
>
> Key: SPARK-8447
> URL: https://issues.apache.org/jira/browse/SPARK-8447
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>
> There is a mismatch between the shuffle managers in Spark core and in the 
> external shuffle service. The latest unsafe shuffle manager is an example of 
> this (SPARK-8430). This issue arose because we apparently do not have 
> sufficient tests for making sure that these two components deal with the same 
> set of shuffle managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11834:
--
Target Version/s: 2.0.0  (was: 1.6.1, 2.0.0)

> Ignore thresholds in LogisticRegression and update documentation
> 
>
> Key: SPARK-11834
> URL: https://issues.apache.org/jira/browse/SPARK-11834
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> ml.LogisticRegression does not support multiclass yet. So we should ignore 
> `thresholds` and update the documentation. In the next release, we can do 
> SPARK-11543.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11259:
--
Target Version/s:   (was: 1.6.1)
Priority: Minor  (was: Major)
  Issue Type: Improvement  (was: Bug)

> Params.validateParams() should be called automatically
> --
>
> Key: SPARK-11259
> URL: https://issues.apache.org/jira/browse/SPARK-11259
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Params.validateParams() can not be called automatically currently. Such as 
> the following code snippet will not throw exception which is not as expected.
> {code}
> val df = sqlContext.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
> (2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
> (3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
> (4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
> ).toDF("id", "features", "label")
> val scaler = new MinMaxScaler()
>  .setInputCol("features")
>  .setOutputCol("features_scaled")
>  .setMin(10)
>  .setMax(0)
> val pipeline = new Pipeline().setStages(Array(scaler))
> pipeline.fit(df)
> {code}
> validateParams() should be called by 
> PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
> put it in transformSchema(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8447) Test external shuffle service with all shuffle managers

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8447:
-
Target Version/s:   (was: 1.6.1)

> Test external shuffle service with all shuffle managers
> ---
>
> Key: SPARK-8447
> URL: https://issues.apache.org/jira/browse/SPARK-8447
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Priority: Critical
>
> There is a mismatch between the shuffle managers in Spark core and in the 
> external shuffle service. The latest unsafe shuffle manager is an example of 
> this (SPARK-8430). This issue arose because we apparently do not have 
> sufficient tests for making sure that these two components deal with the same 
> set of shuffle managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12334:
--
Target Version/s:   (was: 1.6.1)

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11262:
--
Target Version/s:   (was: 1.6.1, 2.0.0)

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12286:
--
Fix Version/s: (was: 2.0.0)

> Support UnsafeRow in all SparkPlan (if possible)
> 
>
> Key: SPARK-12286
> URL: https://issues.apache.org/jira/browse/SPARK-12286
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are still some SparkPlan does not support UnsafeRow (or does not 
> support well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-529) Have a single file that controls the environmental variables and spark config options

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-529:

   Priority: Major
Component/s: Spark Core

> Have a single file that controls the environmental variables and spark config 
> options
> -
>
> Key: SPARK-529
> URL: https://issues.apache.org/jira/browse/SPARK-529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>
> E.g. multiple places in the code base uses SPARK_MEM and has its own default 
> set to 512. We need a central place to enforce default values as well as 
> documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12586) Wrong answer with registerTempTable and union sql query

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12586:
--
Component/s: SQL

> Wrong answer with registerTempTable and union sql query
> ---
>
> Key: SPARK-12586
> URL: https://issues.apache.org/jira/browse/SPARK-12586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Windows 7
>Reporter: shao lo
> Attachments: sql_bug.py
>
>
> The following sequence of sql(), registerTempTable() calls gets the wrong 
> answer.
> The correct answer is returned if the temp table is rewritten?
> sql_text = """select row, col, foo, bar, value2 value
> from (select row, col, foo, bar, 8 value2 from t0 where row=1 
> and col=2) s1
>   union select row, col, foo, bar, value from t0 where 
> not (row=1 and col=2)"""
> df2 = sqlContext.sql(sql_text)
> df2.registerTempTable("t1")
> # # The following 2 line workaround fixes the problem somehow?
> # df3 = sqlContext.createDataFrame(df2.collect())
> # df3.registerTempTable("t1")
> # # The following 4 line workaround fixes the problem too..but takes way 
> longer
> # filename = "t1.json"
> # df2.write.json(filename, mode='overwrite')
> # df3 = sqlContext.read.json(filename)
> # df3.registerTempTable("t1")
> sql_text2 = """select row, col, v1 value from
> (select v1 from
> (select v_value v1 from values) s1
>   left join
> (select value v2,foo,bar,row,col from t1
>   where foo=1
> and bar=2 and value is not null) s2
>   on v1=v2 where v2 is null
> ) sa join
> (select row, col from t1 where foo=1
> and bar=2 and value is null) sb"""
> result = sqlContext.sql(sql_text2)
> result.show()
> 
> # Expected result
> # +---+---+-+
> # |row|col|value|
> # +---+---+-+
> # |  3|  4|1|
> # |  3|  4|2|
> # |  3|  4|3|
> # |  3|  4|4|
> # +---+---+-+
> # Getting this wrong result...when not using the workarounds above
> # +---+---+-+
> # |row|col|value|
> # +---+---+-+
> # +---+---+-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12598) Bug in setMinPartitions function of StreamFileInputFormat

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12598:
--
Component/s: Spark Core

> Bug in setMinPartitions function of StreamFileInputFormat
> -
>
> Key: SPARK-12598
> URL: https://issues.apache.org/jira/browse/SPARK-12598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Darek Blasiak
>Priority: Minor
>
> The maxSplitSize should be computed as:
> val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartitions).toLong



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12546) Writing to partitioned parquet table can fail with OOM

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12546:
--
Component/s: SQL

> Writing to partitioned parquet table can fail with OOM
> --
>
> Key: SPARK-12546
> URL: https://issues.apache.org/jira/browse/SPARK-12546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>
> It is possible to have jobs fail with OOM when writing to a partitioned 
> parquet table. While this was probably always possible, it is more likely in 
> 1.6 due to the memory manager changes. The unified memory manager enables 
> Spark to use more of the process memory (in particular, for execution) which 
> gets us in this state more often. This issue can happen for libraries that 
> consume a lot of memory, such as parquet. Prior to 1.6, these libraries would 
> more likely use memory that spark was not using (i.e. from the storage pool). 
> In 1.6, this storage memory can now be used for execution.
> There are a couple of configs that can help with this issue.
>   - parquet.memory.pool.ratio: This is a parquet config on how much of the 
> heap the parquet writers should use. This default to .95. Consider a much 
> lower value (e.g. 0.1)
>   - spark.memory.faction: This is a spark config to control how much of the 
> memory should be allocated to spark. Consider setting this to 0.6.
> This should cause jobs to potentially spill more but require less memory. 
> More aggressive tuning will control this trade off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12582:


Assignee: Apache Spark  (was: yucai)

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: Apache Spark
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12582:


Assignee: yucai  (was: Apache Spark)

> IndexShuffleBlockResolverSuite fails in windows
> ---
>
> Key: SPARK-12582
> URL: https://issues.apache.org/jira/browse/SPARK-12582
> Project: Spark
>  Issue Type: Bug
>  Components: Tests, Windows
>Reporter: yucai
>Assignee: yucai
>
> IndexShuffleBlockResolverSuite fails in my windows develop machine.
> {code}
> [info] IndexShuffleBlockResolverSuite:
> [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds)
> [info]   Array(10, 0, 20) equaled Array(10, 0, 20) 
> (IndexShuffleBlockResolverSuite.scala:108)
> [info]   org.scalatest.exceptions.TestFailedException:
> .
> .
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.shuffle.sort.IndexShuffleB
> lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712
> -4b1c-a089-f421db149e65
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 60)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
> [info]   at 
> org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala:
> 36)
> {code}
> Root cause is when "afterEach" wants to clean up data, some files are still 
> open. For example:
> {code}
> // The dataFile should be the previous one
> val in = new FileInputStream(dataFile)
> val firstByte = new Array[Byte](1)
> in.read(firstByte)
> assert(firstByte(0) === 0)
> {code}
> Lack of "in.close()". 
> In Linux, it is not a problem, you can still delete a file even it is open, 
> but this does not work in windows, which will report "resource is busy".
> Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file 
> but it is placed in "test/java".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12481) Remove usage of Hadoop deprecated APIs and reflection that supported 1.x

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12481.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10446
[https://github.com/apache/spark/pull/10446]

> Remove usage of Hadoop deprecated APIs and reflection that supported 1.x
> 
>
> Key: SPARK-12481
> URL: https://issues.apache.org/jira/browse/SPARK-12481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Streaming
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> Many API calls that were deprecated as of Hadoop 2.2 can be fixed now to use 
> the non-deprecated methods. Also, some reflection-based acrobatics to support 
> 2.x and 1.x can be removed now too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12604) Java count(AprroxDistinct)ByKey methods return Scala Long not Java

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076513#comment-15076513
 ] 

Apache Spark commented on SPARK-12604:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10554

> Java count(AprroxDistinct)ByKey methods return Scala Long not Java
> --
>
> Key: SPARK-12604
> URL: https://issues.apache.org/jira/browse/SPARK-12604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
>
> Another minor API problem I noticed while digging around. The following Java 
> API methods return a Long as part of their signature, but it's a 
> {{scala.Long}}, not a {{java.lang.Long}}:
> * countByKey
> * countApproxDistinctByKey
> Other similar methods correctly return a Java Long, like countByValue, and 
> the whole Java streaming API.
> Of course, changing this is probably the right thing to do, but also is an 
> API change. I think it's worth fixing up 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12604) Java count(AprroxDistinct)ByKey methods return Scala Long not Java

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12604:


Assignee: Sean Owen  (was: Apache Spark)

> Java count(AprroxDistinct)ByKey methods return Scala Long not Java
> --
>
> Key: SPARK-12604
> URL: https://issues.apache.org/jira/browse/SPARK-12604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
>
> Another minor API problem I noticed while digging around. The following Java 
> API methods return a Long as part of their signature, but it's a 
> {{scala.Long}}, not a {{java.lang.Long}}:
> * countByKey
> * countApproxDistinctByKey
> Other similar methods correctly return a Java Long, like countByValue, and 
> the whole Java streaming API.
> Of course, changing this is probably the right thing to do, but also is an 
> API change. I think it's worth fixing up 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12604) Java count(AprroxDistinct)ByKey methods return Scala Long not Java

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12604:


Assignee: Apache Spark  (was: Sean Owen)

> Java count(AprroxDistinct)ByKey methods return Scala Long not Java
> --
>
> Key: SPARK-12604
> URL: https://issues.apache.org/jira/browse/SPARK-12604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>  Labels: releasenotes
>
> Another minor API problem I noticed while digging around. The following Java 
> API methods return a Long as part of their signature, but it's a 
> {{scala.Long}}, not a {{java.lang.Long}}:
> * countByKey
> * countApproxDistinctByKey
> Other similar methods correctly return a Java Long, like countByValue, and 
> the whole Java streaming API.
> Of course, changing this is probably the right thing to do, but also is an 
> API change. I think it's worth fixing up 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12537) Add option to accept quoting of all character backslash quoting mechanism

2016-01-02 Thread Cazen Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076510#comment-15076510
 ] 

Cazen Lee commented on SPARK-12537:
---

Happy New Year!
The situation seemed to require further discussion
Tell me what I can to help on this issue
Thank you

> Add option to accept quoting of all character backslash quoting mechanism
> -
>
> Key: SPARK-12537
> URL: https://issues.apache.org/jira/browse/SPARK-12537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Cazen Lee
>Assignee: Apache Spark
>
> We can provides the option to choose JSON parser can be enabled to accept 
> quoting of all character or not.
> For example, if JSON file that includes not listed by JSON backslash quoting 
> specification, it returns corrupt_record
> {code:title=JSON File|borderStyle=solid}
> {"name": "Cazen Lee", "price": "$10"}
> {"name": "John Doe", "price": "\$20"}
> {"name": "Tracy", "price": "$10"}
> {code}
> corrupt_record(returns null)
> {code}
> scala> df.show
> ++-+-+
> | _corrupt_record| name|price|
> ++-+-+
> |null|Cazen Lee|  $10|
> |{"name": "John Do...| null| null|
> |null|Tracy|  $10|
> ++-+-+
> {code}
> And after apply this patch, we can enable allowBackslashEscapingAnyCharacter 
> option like below
> {code}
> scala> val df = sqlContext.read.option("allowBackslashEscapingAnyCharacter", 
> "true").json("/user/Cazen/test/test2.txt")
> df: org.apache.spark.sql.DataFrame = [name: string, price: string]
> scala> df.show
> +-+-+
> | name|price|
> +-+-+
> |Cazen Lee|  $10|
> | John Doe|  $20|
> |Tracy|  $10|
> +-+-+
> {code}
> This issue similar to HIVE-11825, HIVE-12717.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12598) Bug in setMinPartitions function of StreamFileInputFormat

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12598:


Assignee: (was: Apache Spark)

> Bug in setMinPartitions function of StreamFileInputFormat
> -
>
> Key: SPARK-12598
> URL: https://issues.apache.org/jira/browse/SPARK-12598
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: Darek Blasiak
>Priority: Minor
>
> The maxSplitSize should be computed as:
> val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartitions).toLong



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12598) Bug in setMinPartitions function of StreamFileInputFormat

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076509#comment-15076509
 ] 

Apache Spark commented on SPARK-12598:
--

User 'datafarmer' has created a pull request for this issue:
https://github.com/apache/spark/pull/10546

> Bug in setMinPartitions function of StreamFileInputFormat
> -
>
> Key: SPARK-12598
> URL: https://issues.apache.org/jira/browse/SPARK-12598
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: Darek Blasiak
>Priority: Minor
>
> The maxSplitSize should be computed as:
> val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartitions).toLong



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12598) Bug in setMinPartitions function of StreamFileInputFormat

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12598:


Assignee: Apache Spark

> Bug in setMinPartitions function of StreamFileInputFormat
> -
>
> Key: SPARK-12598
> URL: https://issues.apache.org/jira/browse/SPARK-12598
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: Darek Blasiak
>Assignee: Apache Spark
>Priority: Minor
>
> The maxSplitSize should be computed as:
> val maxSplitSize = Math.ceil(totalLen * 1.0 / minPartitions).toLong



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12604) Java count(AprroxDistinct)ByKey methods return Scala Long not Java

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12604:
--
Description: 
Another minor API problem I noticed while digging around. The following Java 
API methods return a Long as part of their signature, but it's a 
{{scala.Long}}, not a {{java.lang.Long}}:

* countByKey
* countApproxDistinctByKey

Other similar methods correctly return a Java Long, like countByValue, and the 
whole Java streaming API.

Of course, changing this is probably the right thing to do, but also is an API 
change. I think it's worth fixing up 

  was:
Another minor API problem I noticed while digging around. The following Java 
API methods return a Long as part of their signature, but it's a 
{{scala.Long}}, not a {{java.lang.Long}}:

* zipWithIndex
* zipWithUniqueId
* countByKey
* countAsync

Other similar methods correctly return a Java Long, like countByValue, and the 
whole Java streaming API.

Of course, changing this is probably the right thing to do, but also is an API 
change. I think it's worth fixing up 

Summary: Java count(AprroxDistinct)ByKey methods return Scala Long not 
Java  (was: Java countByKey, zipWith*, etc methods return Scala Long not Java)

> Java count(AprroxDistinct)ByKey methods return Scala Long not Java
> --
>
> Key: SPARK-12604
> URL: https://issues.apache.org/jira/browse/SPARK-12604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, Spark Core
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
>
> Another minor API problem I noticed while digging around. The following Java 
> API methods return a Long as part of their signature, but it's a 
> {{scala.Long}}, not a {{java.lang.Long}}:
> * countByKey
> * countApproxDistinctByKey
> Other similar methods correctly return a Java Long, like countByValue, and 
> the whole Java streaming API.
> Of course, changing this is probably the right thing to do, but also is an 
> API change. I think it's worth fixing up 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12604) Java countByKey, zipWith*, etc methods return Scala Long not Java

2016-01-02 Thread Sean Owen (JIRA)

Sean Owen created SPARK-12604:
-

 Summary: Java countByKey, zipWith*, etc methods return Scala Long 
not Java
 Key: SPARK-12604
 URL: https://issues.apache.org/jira/browse/SPARK-12604
 Project: Spark
  Issue Type: Sub-task
  Components: Java API, Spark Core
Affects Versions: 1.6.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


Another minor API problem I noticed while digging around. The following Java 
API methods return a Long as part of their signature, but it's a 
{{scala.Long}}, not a {{java.lang.Long}}:

* zipWithIndex
* zipWithUniqueId
* countByKey
* countAsync

Other similar methods correctly return a Java Long, like countByValue, and the 
whole Java streaming API.

Of course, changing this is probably the right thing to do, but also is an API 
change. I think it's worth fixing up 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12421) Fix copy() method of GenericRow

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076501#comment-15076501
 ] 

Apache Spark commented on SPARK-12421:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/10553

> Fix copy() method of GenericRow 
> 
>
> Key: SPARK-12421
> URL: https://issues.apache.org/jira/browse/SPARK-12421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: doepfner
>Priority: Minor
>
> The copy() method of the GenericRow class does actually not copy itself. The 
> method just returns itself.
> Simple reproduction code of the issue:
>  import org.apache.spark.sql.Row;
> val row = Row.fromSeq(Array(1,2,3,4,5))
> val arr = row.toSeq.toArray
> arr(0) = 6
> row // first value changed to 6
> val rowCopied = row.copy()
> val arrCopied = rowCopied.toSeq.toArray
> arrCopied(0) = 7
> row // first value still changed (to 7)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12453:


Assignee: (was: Apache Spark)

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12453:


Assignee: Apache Spark

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Assignee: Apache Spark
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2016-01-02 Thread Tu Dinh Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076492#comment-15076492
 ] 

Tu Dinh Nguyen commented on SPARK-8555:
---

Oh, I see. Thank you for pointing this out!


> Online Variational Inference for the Hierarchical Dirichlet Process
> ---
>
> Key: SPARK-8555
> URL: https://issues.apache.org/jira/browse/SPARK-8555
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> The task is created for exploration on the online HDP algorithm described in
> http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
> Major advantage for the algorithm: one pass on corpus, streaming friendly, 
> automatic K (topic number).
> Currently the scope is to support online HDP for topic modeling, i.e. 
> probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2016-01-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076488#comment-15076488
 ] 

Sean Owen commented on SPARK-8555:
--

Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
 ; generally speaking there are so many algorithms to implement and most aren't 
that useful or widely used, and so few really belong in MLlib itself. I'm not 
commenting on HDP here, though I don't think it's that commonly used. The idea 
is that it should prove itself out externally.

> Online Variational Inference for the Hierarchical Dirichlet Process
> ---
>
> Key: SPARK-8555
> URL: https://issues.apache.org/jira/browse/SPARK-8555
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> The task is created for exploration on the online HDP algorithm described in
> http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
> Major advantage for the algorithm: one pass on corpus, streaming friendly, 
> automatic K (topic number).
> Currently the scope is to support online HDP for topic modeling, i.e. 
> probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8555) Online Variational Inference for the Hierarchical Dirichlet Process

2016-01-02 Thread Tu Dinh Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076482#comment-15076482
 ] 

Tu Dinh Nguyen commented on SPARK-8555:
---

Hi Sean,

Thank you for you reply! Would you mind if I ask for the reasons why Spark is 
not interested in HDP?



> Online Variational Inference for the Hierarchical Dirichlet Process
> ---
>
> Key: SPARK-8555
> URL: https://issues.apache.org/jira/browse/SPARK-8555
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Minor
>
> The task is created for exploration on the online HDP algorithm described in
> http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf.
> Major advantage for the algorithm: one pass on corpus, streaming friendly, 
> automatic K (topic number).
> Currently the scope is to support online HDP for topic modeling, i.e. 
> probably an optimizer for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12603:

Description: PySpark MLlib GaussianMixtureModel should support single 
instance predict/predictSoft just like Scala do.  (was: PySpark MLlib 
GaussianMixtureModel should support single instance predict/predictSoft just 
like Scala one.)

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft just like Scala do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12603:


Assignee: (was: Apache Spark)

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft just like Scala one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076478#comment-15076478
 ] 

Apache Spark commented on SPARK-12603:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10552

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft just like Scala one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12603:


Assignee: Apache Spark

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft just like Scala one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12601) worker output a large number of log when size RollingPolicy shouldRollover method use loginfo

2016-01-02 Thread Ricky Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricky Yang reopened SPARK-12601:


please  check this issue need repair

>  worker output a large number of log when size RollingPolicy shouldRollover  
> method use  loginfo
> 
>
> Key: SPARK-12601
> URL: https://issues.apache.org/jira/browse/SPARK-12601
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.2
> Environment: standlone 
>Reporter: Ricky Yang
>
> when using size RollingPolicythis code cause worker output a large of log .it 
> shoud be changed to logDebug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2016-01-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076474#comment-15076474
 ] 

Sean Owen commented on SPARK-6416:
--

That's an interesting example, in that I wouldn't have though the 3rd and 4th 
examples would differ. However your example does violate the contract, since 
you're providing 1 as a neutral element for addition, which isn't valid. In 
Josh's example, he passes 0 and still gets the differing results depending on 
partitions. Your 1st and 2nd examples show Scala APIs would give the same 
answer. Does that change your thinking?

> RDD.fold() requires the operator to be commutative
> --
>
> Key: SPARK-6416
> URL: https://issues.apache.org/jira/browse/SPARK-6416
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> Spark's {{RDD.fold}} operation has some confusing behaviors when a 
> non-commutative reduce function is used.
> Here's an example, which was originally reported on StackOverflow 
> (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
> {code}
> sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
> 8
> {code}
> To understand what's going on here, let's look at the definition of Spark's 
> `fold` operation.  
> I'm going to show the Python version of the code, but the Scala version 
> exhibits the exact same behavior (you can also [browse the source on 
> GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
> {code}
> def fold(self, zeroValue, op):
> """
> Aggregate the elements of each partition, and then the results for all
> the partitions, using a given associative function and a neutral "zero
> value."
> The function C{op(t1, t2)} is allowed to modify C{t1} and return it
> as its result value to avoid object allocation; however, it should not
> modify C{t2}.
> >>> from operator import add
> >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
> 15
> """
> def func(iterator):
> acc = zeroValue
> for obj in iterator:
> acc = op(obj, acc)
> yield acc
> vals = self.mapPartitions(func).collect()
> return reduce(op, vals, zeroValue)
> {code}
> (For comparison, see the [Scala implementation of 
> `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
> Spark's `fold` operates by first folding each partition and then folding the 
> results.  The problem is that an empty partition gets folded down to the zero 
> element, so the final driver-side fold ends up folding one value for _every_ 
> partition rather than one value for each _non-empty_ partition.  This means 
> that the result of `fold` is sensitive to the number of partitions:
> {code}
> >>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
> 100
> >>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
> 50
> >>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
> 1
> {code}
> In this last case, what's happening is that the single partition is being 
> folded down to the correct value, then that value is folded with the 
> zero-value at the driver to yield 1.
> I think the underlying problem here is that our fold() operation implicitly 
> requires the operator to be commutative in addition to associative, but this 
> isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
> Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
> Therefore, I think we should update the documentation and examples to clarify 
> this requirement and explain that our fold acts more like a reduce with a 
> default value than the type of ordering-sensitive fold() that users may 
> expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12597) Use udf to replace callUDF for ML

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-12597.
---
Resolution: Duplicate

> Use udf to replace callUDF for ML
> -
>
> Key: SPARK-12597
> URL: https://issues.apache.org/jira/browse/SPARK-12597
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> callUDF has been deprecated and will be removed in Spark 2.0. We should 
> replace the use of callUDF with udf for ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12601) worker output a large number of log when size RollingPolicy shouldRollover method use loginfo

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-12601:
---

>  worker output a large number of log when size RollingPolicy shouldRollover  
> method use  loginfo
> 
>
> Key: SPARK-12601
> URL: https://issues.apache.org/jira/browse/SPARK-12601
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.2
> Environment: standlone 
>Reporter: Ricky Yang
>
> when using size RollingPolicythis code cause worker output a large of log .it 
> shoud be changed to logDebug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12601) worker output a large number of log when size RollingPolicy shouldRollover method use loginfo

2016-01-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12601.
---
  Resolution: Invalid
   Fix Version/s: (was: 1.6.1)
  (was: 1.6.0)
Target Version/s:   (was: 1.6.0, 1.6.1)

I'm not sure what's going on with this JIRA/PR, but you need to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first 
before opening one. It's not 'fixed' and not even clear something should change.

>  worker output a large number of log when size RollingPolicy shouldRollover  
> method use  loginfo
> 
>
> Key: SPARK-12601
> URL: https://issues.apache.org/jira/browse/SPARK-12601
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.2
> Environment: standlone 
>Reporter: Ricky Yang
>
> when using size RollingPolicythis code cause worker output a large of log .it 
> shoud be changed to logDebug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12603:

Description: PySpark MLlib GaussianMixtureModel should support single 
instance predict/predictSoft just like Scala one.  (was: PySpark MLlib 
GaussianMixtureModel should support single instance predict/predictSoft.)

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft just like Scala one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12603:

Description: PySpark MLlib GaussianMixtureModel should support single 
instance predict/predictSoft.  (was: MLlib GaussianMixtureModel should support 
single instance predict/predictSoft.)

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12603:

Summary: PySpark MLlib GaussianMixtureModel should support single instance 
predict/predictSoft  (was: MLlib GaussianMixtureModel should support single 
instance predict/predictSoft)

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> MLlib GaussianMixtureModel should support single instance predict/predictSoft.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12603) PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12603:

Component/s: PySpark

> PySpark MLlib GaussianMixtureModel should support single instance 
> predict/predictSoft
> -
>
> Key: SPARK-12603
> URL: https://issues.apache.org/jira/browse/SPARK-12603
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>
> MLlib GaussianMixtureModel should support single instance predict/predictSoft.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12603) MLlib GaussianMixtureModel should support single instance predict/predictSoft

2016-01-02 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-12603:
---

 Summary: MLlib GaussianMixtureModel should support single instance 
predict/predictSoft
 Key: SPARK-12603
 URL: https://issues.apache.org/jira/browse/SPARK-12603
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang


MLlib GaussianMixtureModel should support single instance predict/predictSoft.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10180) JDBCRDD does not process EqualNullSafe filter.

2016-01-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10180.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> JDBCRDD does not process EqualNullSafe filter.
> --
>
> Key: SPARK-10180
> URL: https://issues.apache.org/jira/browse/SPARK-10180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.0
>
>
> Simply {{JDBCRelation}} passes EqualNullSafe (source.filter) but 
> {{compileFilter()}} in {{JDBCRDD}} does not apply this.
> It would be a single-line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

79 matches

Mail list logo