date:20151113

[jira] [Commented] (SPARK-11728) Replace example code in ml-ensembles.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005225#comment-15005225
 ] 

Xusen Yin commented on SPARK-11728:
---

I'll take this.

> Replace example code in ml-ensembles.md using include_example
> -
>
> Key: SPARK-11728
> URL: https://issues.apache.org/jira/browse/SPARK-11728
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005217#comment-15005217
 ] 

Takeshi Yamamuro commented on SPARK-11704:
--

You're right; they're not automatically cached.
I just say that earlier stages in rdd2 are skipped and the iterator just 
fetches blocks from remote BlockManager (the blocks are written in ShuffleRDD). 
You mean that fetching remote blocks is too slow?

Anyway, adding cleanup hook can make a big impact on SparkPlan interfaces.
As an alternative idea, how about caching rdd2 in unsafe space in a similar 
logic of UnsafeExternalSorter?
We can release the space by using TaskContext#addTaskCompletionListener.

If you have no time, I'm okay to take it.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

2015-11-13 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005208#comment-15005208
 ] 

Jeff Zhang commented on SPARK-11725:


Thanks [~hvanhovell] Should we prevent use primitive in UDF arguments ? 
otherwise user may get confusing result. 

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11729) Replace example code in ml-linear-methods.md using include_example

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11729:


Assignee: Apache Spark

> Replace example code in ml-linear-methods.md using include_example
> --
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11729) Replace example code in ml-linear-methods.md using include_example

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11729:


Assignee: (was: Apache Spark)

> Replace example code in ml-linear-methods.md using include_example
> --
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11729) Replace example code in ml-linear-methods.md using include_example

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005207#comment-15005207
 ] 

Apache Spark commented on SPARK-11729:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9713

> Replace example code in ml-linear-methods.md using include_example
> --
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11743:


Assignee: Apache Spark

> Add UserDefinedType support to RowEncoder
> -
>
> Key: SPARK-11743
> URL: https://issues.apache.org/jira/browse/SPARK-11743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> RowEncoder doesn't support UserDefinedType now. We should add the support for 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005197#comment-15005197
 ] 

Apache Spark commented on SPARK-11743:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9712

> Add UserDefinedType support to RowEncoder
> -
>
> Key: SPARK-11743
> URL: https://issues.apache.org/jira/browse/SPARK-11743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> RowEncoder doesn't support UserDefinedType now. We should add the support for 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11743:


Assignee: (was: Apache Spark)

> Add UserDefinedType support to RowEncoder
> -
>
> Key: SPARK-11743
> URL: https://issues.apache.org/jira/browse/SPARK-11743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> RowEncoder doesn't support UserDefinedType now. We should add the support for 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-13 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-11743:
---

 Summary: Add UserDefinedType support to RowEncoder
 Key: SPARK-11743
 URL: https://issues.apache.org/jira/browse/SPARK-11743
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


RowEncoder doesn't support UserDefinedType now. We should add the support for 
it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005175#comment-15005175
 ] 

Apache Spark commented on SPARK-11742:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9711

> Show batch failures in the Streaming UI landing page
> 
>
> Key: SPARK-11742
> URL: https://issues.apache.org/jira/browse/SPARK-11742
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11742:


Assignee: (was: Apache Spark)

> Show batch failures in the Streaming UI landing page
> 
>
> Key: SPARK-11742
> URL: https://issues.apache.org/jira/browse/SPARK-11742
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11742:


Assignee: Apache Spark

> Show batch failures in the Streaming UI landing page
> 
>
> Key: SPARK-11742
> URL: https://issues.apache.org/jira/browse/SPARK-11742
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-13 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-11742:


 Summary: Show batch failures in the Streaming UI landing page
 Key: SPARK-11742
 URL: https://issues.apache.org/jira/browse/SPARK-11742
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136
 ] 

Zhan Zhang edited comment on SPARK-11704 at 11/14/15 5:16 AM:
--

I think we can add a register and a cleanup hook in query context. Before the 
query is performed, the register handlers are invoked (such as persist), and 
cleanup hooks are invoked (e.g., unpersist) after the query is done. By this 
way, the CartesianProduct we can cache the rightResult in the registered 
handler and unpersist the rightResult after the query. By this way, we avoid 
the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.


was (Author: zzhan):
I think we can add a cleanup hook in SQLContext, and when the query is done, we 
invoke all the cleanup hook registered. By this way, the CartesianProduct we 
can cache the rightResult and register the cleanup handler(unpersist). By this 
way, we avoid the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11729) Replace example code in ml-linear-methods.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11729:
--
Description: (was: Process these two markdown files in one JIRA issue 
because they have fewer code changes than other files.)

> Replace example code in ml-linear-methods.md using include_example
> --
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11729) Replace example code in ml-linear-methods.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11729:
--
Summary: Replace example code in ml-linear-methods.md using include_example 
 (was: Replace example code in ml-linear-methods.md and ml-ann.md using 
include_example)

> Replace example code in ml-linear-methods.md using include_example
> --
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> Process these two markdown files in one JIRA issue because they have fewer 
> code changes than other files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11729) Replace example code in ml-linear-methods.md and ml-ann.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005138#comment-15005138
 ] 

Xusen Yin commented on SPARK-11729:
---

OK I'll take it.

> Replace example code in ml-linear-methods.md and ml-ann.md using 
> include_example
> 
>
> Key: SPARK-11729
> URL: https://issues.apache.org/jira/browse/SPARK-11729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> Process these two markdown files in one JIRA issue because they have fewer 
> code changes than other files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005136#comment-15005136
 ] 

Zhan Zhang commented on SPARK-11704:


I think we can add a cleanup hook in SQLContext, and when the query is done, we 
invoke all the cleanup hook registered. By this way, the CartesianProduct we 
can cache the rightResult and register the cleanup handler(unpersist). By this 
way, we avoid the recomputation of RDD2. 

In my testing, because rdd2 is quite small, I actually reverse the cartesian 
join by reverting cartesian(rdd1, rdd2) to cartesian(rdd2, rdd1). By this way, 
the computation is done quite fast, but the original form cannot be finished.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005134#comment-15005134
 ] 

Zhan Zhang commented on SPARK-11704:


[~maropu] Maybe I misunderstand. If RDD2 is coming from ShuffleRDD, each new 
iterator will try to fetch from network because RDD2 is not cached. Is the 
ShuffleRDD cached automatically?

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005111#comment-15005111
 ] 

Takeshi Yamamuro commented on SPARK-11704:
--

ISTM that some earlier stages in rdd2 are skipped in all the iterations except 
the first one in case of rdd2 comming from ShuffleRDD.
That said, it is worth doing this optimization.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-7970.
--
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
>Assignee: Nitin Goyal
> Fix For: 1.6.0
>
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005074#comment-15005074
 ] 

Marcelo Vanzin commented on SPARK-11702:


Sean, it's a real bug. He's not trying to use a different Guava. He's trying to 
use the metastore client libraries for his version of Hive, but HiveContext, 
when loading those libraries, says "everything under com.google.common should 
be loaded from the parent class loader".

Except that when you build Spark with maven, the parent class loader does not 
contain any of those classes, since they're shaded.

> Guava ClassLoading Issue When Using Different Hive Metastore Version
> 
>
> Key: SPARK-11702
> URL: https://issues.apache.org/jira/browse/SPARK-11702
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Joey Paskhay
>
> A Guava classloading error can occur when using a different version of the 
> Hive metastore.
> Running the latest version of Spark at this time (1.5.1) and patched versions 
> of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
> "1.0.0" and "spark.sql.hive.metastore.jars" to 
> "/lib/*:". When trying to 
> launch the spark-shell, the sqlContext would fail to initialize with:
> {code}
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> com/google/common/base/Predicate when creating Hive client using classpath: 
> 
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
> defaultValue=builtin, doc=...
> {code}
> We verified the Guava libraries are in the huge list of the included jars, 
> but we saw that in the 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
> seems to assume that *all* "com.google" (excluding "com.google.cloud") 
> classes should be loaded from the base class loader. The Spark libraries seem 
> to have *some* "com.google.common.base" classes shaded in but not all.
> See 
> [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
>  and its replies.
> The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
> property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-13 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-11702:


> Guava ClassLoading Issue When Using Different Hive Metastore Version
> 
>
> Key: SPARK-11702
> URL: https://issues.apache.org/jira/browse/SPARK-11702
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Joey Paskhay
>
> A Guava classloading error can occur when using a different version of the 
> Hive metastore.
> Running the latest version of Spark at this time (1.5.1) and patched versions 
> of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
> "1.0.0" and "spark.sql.hive.metastore.jars" to 
> "/lib/*:". When trying to 
> launch the spark-shell, the sqlContext would fail to initialize with:
> {code}
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> com/google/common/base/Predicate when creating Hive client using classpath: 
> 
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
> defaultValue=builtin, doc=...
> {code}
> We verified the Guava libraries are in the huge list of the included jars, 
> but we saw that in the 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
> seems to assume that *all* "com.google" (excluding "com.google.cloud") 
> classes should be loaded from the base class loader. The Spark libraries seem 
> to have *some* "com.google.common.base" classes shaded in but not all.
> See 
> [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
>  and its replies.
> The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
> property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005023#comment-15005023
 ] 

Cheng Lian commented on SPARK-11153:


Good question. We tried, see [PR 
#9225|https://github.com/apache/spark/pull/9225], but at last decided not to do 
so.  There are several reasons:

This issue can be hacked and worked around though.
# Parquet-mr 1.8.1 is not in a very good status, while we don't have much time 
left to test it before 1.6 release. The two major issues we found are:
#- We observed performance regression for full Parquet table scanning, but the 
reason is still unknown yet.
#- Parquet-mr 1.8.1 introduced PARQUET-363, which brings performance regression 
for queries like {{SELECT COUNT(1) FROM t}}. (This issue can be hacked and 
worked around though.)
# Parquet-mr 1.8.1 hasn't been widely deployed yet (e.g. Hive 1.2.1 is still 
using 1.6.0), which means that most Parquet files out there all suffer the 
corrupted statistics issue. Thus using parquet-mr 1.7.0 in Spark 1.6 while 
disabling filter push-down for string/binary columns doesn't bring too much 
negative impact.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005019#comment-15005019
 ] 

Oscar D. Lara Yejas edited comment on SPARK-10863 at 11/14/15 1:19 AM:
---

[~felixcheung] I think a solution to all three issues would be to implement 
wrapper classes for complex types. For example, for StructType, we could have 
something like the small prototype I implemented below (still very raw, but 
just to give you an idea). I'd also need to implement class Row accordingly to 
handle the values.

I could do something similar for MapType, and I believe a list/vector should 
suffice for ArrayType.

Thoughts?

{code:title=Struct.R|borderStyle=solid}
# You can actually just copy and paste the code below on R to run it
setClass("StructField",
 representation(
   name = "character",
   type = "character"
))

# A Struct is a set of StructField objects, modeled as an environment
setClass("Struct",
 representation(
   struct = "environment"
))

# Initialize a Struct from a list of StructField objects
setMethod("initialize", signature = "Struct", definition=
function(.Object, fields) {
  lapply(fields, function(field) {
.Object@struct[[field@name]] <- field
  })
  return(.Object)
})

# Overwrite [[ operator to access the environment directly
setGeneric("[[")
setMethod("[[", signature="Struct", definition=
function(x, i) {
  return(x@struct[[i]])
})

# Overwrite [[<- operator to access the environment directly
setGeneric("[[<-")
setMethod("[[<-", signature="Struct", definition=
function(x, i, value) {
  if (class(value) == "StructField") {
x@struct[[i]] <- value
  }
  return(x)
})

field1 <- new("StructField", name="x", type="numeric")
field2 <- new("StructField", name="y", type="character")
s <- new("Struct", fields=list(field1, field2))
s[["x"]]
s[["z"]] <- new("StructField", name="z", type="logical")

{code}


was (Author: olarayej):
[~felixcheung] I think a solution to all three issues would be to implement 
wrapper classes for complex types. For example, for StructType, we could have 
something like the small prototype I implemented below (still very raw, but 
just to give you an idea). I'd also need to implement class Row accordingly to 
handle the values.

I could do something similar for MapType, and I believe a list/vector should 
suffice for ArrayType.

Thoughts?

# You can actually just copy and paste the code below on R to run it
setClass("StructField",
 representation(
   name = "character",
   type = "character"
))

# A Struct is a set of StructField objects, modeled as an environment
setClass("Struct",
 representation(
   struct = "environment"
))

# Initialize a Struct from a list of StructField objects
setMethod("initialize", signature = "Struct", definition=
function(.Object, fields) {
  lapply(fields, function(field) {
.Object@struct[[field@name]] <- field
  })
  return(.Object)
})

# Overwrite [[ operator to access the environment directly
setGeneric("[[")
setMethod("[[", signature="Struct", definition=
function(x, i) {
  return(x@struct[[i]])
})

# Overwrite [[<- operator to access the environment directly
setGeneric("[[<-")
setMethod("[[<-", signature="Struct", definition=
function(x, i, value) {
  if (class(value) == "StructField") {
x@struct[[i]] <- value
  }
  return(x)
})

field1 <- new("StructField", name="x", type="numeric")
field2 <- new("StructField", name="y", type="character")
s <- new("Struct", fields=list(field1, field2))
s[["x"]]
s[["z"]] <- new("StructField", name="z", type="logical")

> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005019#comment-15005019
 ] 

Oscar D. Lara Yejas commented on SPARK-10863:
-

[~felixcheung] I think a solution to all three issues would be to implement 
wrapper classes for complex types. For example, for StructType, we could have 
something like the small prototype I implemented below (still very raw, but 
just to give you an idea). I'd also need to implement class Row accordingly to 
handle the values.

I could do something similar for MapType, and I believe a list/vector should 
suffice for ArrayType.

Thoughts?

# You can actually just copy and paste the code below on R to run it
setClass("StructField",
 representation(
   name = "character",
   type = "character"
))

# A Struct is a set of StructField objects, modeled as an environment
setClass("Struct",
 representation(
   struct = "environment"
))

# Initialize a Struct from a list of StructField objects
setMethod("initialize", signature = "Struct", definition=
function(.Object, fields) {
  lapply(fields, function(field) {
.Object@struct[[field@name]] <- field
  })
  return(.Object)
})

# Overwrite [[ operator to access the environment directly
setGeneric("[[")
setMethod("[[", signature="Struct", definition=
function(x, i) {
  return(x@struct[[i]])
})

# Overwrite [[<- operator to access the environment directly
setGeneric("[[<-")
setMethod("[[<-", signature="Struct", definition=
function(x, i, value) {
  if (class(value) == "StructField") {
x@struct[[i]] <- value
  }
  return(x)
})

field1 <- new("StructField", name="x", type="numeric")
field2 <- new("StructField", name="y", type="character")
s <- new("Struct", fields=list(field1, field2))
s[["x"]]
s[["z"]] <- new("StructField", name="z", type="logical")

> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005005#comment-15005005
 ] 

Cheng Lian commented on SPARK-11153:


Yes.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005004#comment-15005004
 ] 

Cheng Lian commented on SPARK-11153:


Yes.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11153:
---
Comment: was deleted

(was: Yes.)

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11741) Process doctests using TextTestRunner/XMLTestRunner

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004986#comment-15004986
 ] 

Apache Spark commented on SPARK-11741:
--

User 'gliptak' has created a pull request for this issue:
https://github.com/apache/spark/pull/9710

> Process doctests using TextTestRunner/XMLTestRunner
> ---
>
> Key: SPARK-11741
> URL: https://issues.apache.org/jira/browse/SPARK-11741
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Gabor Liptak
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11741) Process doctests using TextTestRunner/XMLTestRunner

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11741:


Assignee: (was: Apache Spark)

> Process doctests using TextTestRunner/XMLTestRunner
> ---
>
> Key: SPARK-11741
> URL: https://issues.apache.org/jira/browse/SPARK-11741
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Gabor Liptak
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11741) Process doctests using TextTestRunner/XMLTestRunner

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11741:


Assignee: Apache Spark

> Process doctests using TextTestRunner/XMLTestRunner
> ---
>
> Key: SPARK-11741
> URL: https://issues.apache.org/jira/browse/SPARK-11741
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Gabor Liptak
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11741) Process doctests using TextTestRunner/XMLTestRunner

2015-11-13 Thread Gabor Liptak (JIRA)

Gabor Liptak created SPARK-11741:


 Summary: Process doctests using TextTestRunner/XMLTestRunner
 Key: SPARK-11741
 URL: https://issues.apache.org/jira/browse/SPARK-11741
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Gabor Liptak
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11648) IllegalReferenceCountException in Spark workloads

2015-11-13 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004976#comment-15004976
 ] 

Nishkam Ravi commented on SPARK-11648:
--

The updated patch seems to have resolved this issue. Closing this JIRA.

> IllegalReferenceCountException in Spark workloads
> -
>
> Key: SPARK-11648
> URL: https://issues.apache.org/jira/browse/SPARK-11648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> This exception is thrown for multiple workloads. Can be reproduced with 
> WordCount/PageRank/TeraSort.
> -
> Stack trace:
> 15/11/10 01:11:31 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 459, 
> 10.20.78.15): io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178)
>   at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1129)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:180)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:687)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:42)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:181)
>   at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:677)
>   at io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:360)
>   at com.ning.compress.lzf.ChunkDecoder.readHeader(ChunkDecoder.java:213)
>   at 
> com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:49)
>   at 
> com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
>   at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
>   at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:299)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:64)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:60)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
>   at 
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11648) IllegalReferenceCountException in Spark workloads

2015-11-13 Thread Nishkam Ravi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi resolved SPARK-11648.
--
Resolution: Fixed

> IllegalReferenceCountException in Spark workloads
> -
>
> Key: SPARK-11648
> URL: https://issues.apache.org/jira/browse/SPARK-11648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> This exception is thrown for multiple workloads. Can be reproduced with 
> WordCount/PageRank/TeraSort.
> -
> Stack trace:
> 15/11/10 01:11:31 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 459, 
> 10.20.78.15): io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178)
>   at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1129)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:180)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:687)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:42)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:181)
>   at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:677)
>   at io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:360)
>   at com.ning.compress.lzf.ChunkDecoder.readHeader(ChunkDecoder.java:213)
>   at 
> com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:49)
>   at 
> com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
>   at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
>   at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:299)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:64)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:60)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
>   at 
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004944#comment-15004944
 ] 

Mark Hamstra commented on SPARK-11153:
--

Is there a reason why parquet.version hasn't been pushed up in Spark 1.6 and 
this issue actually fixed instead of just disabling filter push-down for 
strings and binaries?

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-13 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004924#comment-15004924
 ] 

Reynold Xin commented on SPARK-11583:
-

I will get somebody to take a look at this and recommend a solution for 1.6. 
One possibility is just to revert the old patch.


> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11569) StringIndexer transform fails when column contains nulls

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11569:


Assignee: Apache Spark

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ## ) failed: 
> py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11569) StringIndexer transform fails when column contains nulls

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11569:


Assignee: (was: Apache Spark)

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ## ) failed: 
> py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004912#comment-15004912
 ] 

Apache Spark commented on SPARK-11569:
--

User 'jliwork' has created a pull request for this issue:
https://github.com/apache/spark/pull/9709

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ## ) failed: 
> py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8582:
-
Target Version/s: 1.7.0  (was: 1.6.0)

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8582) Optimize checkpointing to avoid computing an RDD twice

2015-11-13 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004874#comment-15004874
 ] 

Andrew Or commented on SPARK-8582:
--

Hi everyone, I have bumped this to 1.7.0 because of the potential performance 
regressions a fix could introduce. If you are affected by this and would like 
to solve this earlier, then you can workaround this by calling `persist` first 
before you call `checkpoint`. This ensures that the second time you compute the 
RDD reads from the cache instead, which is much faster for many workloads.

> Optimize checkpointing to avoid computing an RDD twice
> --
>
> Key: SPARK-8582
> URL: https://issues.apache.org/jira/browse/SPARK-8582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Shixiong Zhu
>
> In Spark, checkpointing allows the user to truncate the lineage of his RDD 
> and save the intermediate contents to HDFS for fault tolerance. However, this 
> is not currently implemented super efficiently:
> Every time we checkpoint an RDD, we actually compute it twice: once during 
> the action that triggered the checkpointing in the first place, and once 
> while we checkpoint (we iterate through an RDD's partitions and write them to 
> disk). See this line for more detail: 
> https://github.com/apache/spark/blob/0401cbaa8ee51c71f43604f338b65022a479da0a/core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala#L102.
> Instead, we should have a `CheckpointingInterator` that writes checkpoint 
> data to HDFS while we run the action. This will speed up many usages of 
> `RDD#checkpoint` by 2X.
> (Alternatively, the user can just cache the RDD before checkpointing it, but 
> this is not always viable for very large input data. It's also not a great 
> API to use in general.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11648) IllegalReferenceCountException in Spark workloads

2015-11-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004866#comment-15004866
 ] 

Marcelo Vanzin commented on SPARK-11648:


I'm pretty sure the fix for SPARK-11617 will also fix this; there's an updated 
patch since your last comment.

> IllegalReferenceCountException in Spark workloads
> -
>
> Key: SPARK-11648
> URL: https://issues.apache.org/jira/browse/SPARK-11648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> This exception is thrown for multiple workloads. Can be reproduced with 
> WordCount/PageRank/TeraSort.
> -
> Stack trace:
> 15/11/10 01:11:31 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 459, 
> 10.20.78.15): io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178)
>   at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1129)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:180)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:687)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:42)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:181)
>   at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:677)
>   at io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:360)
>   at com.ning.compress.lzf.ChunkDecoder.readHeader(ChunkDecoder.java:213)
>   at 
> com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:49)
>   at 
> com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
>   at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
>   at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:299)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:64)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:60)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
>   at 
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11740) Fix DStream checkpointing logic to prevent failures during checkpoint recovery

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004849#comment-15004849
 ] 

Apache Spark commented on SPARK-11740:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9707

> Fix DStream checkpointing logic to prevent failures during checkpoint recovery
> --
>
> Key: SPARK-11740
> URL: https://issues.apache.org/jira/browse/SPARK-11740
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> We will do checkpoint when generating a batch and completing a batch. When 
> the processing time of a batch is greater than the batch interval, 
> checkpointing for completing an old batch may run after checkpointing of a 
> new batch. If this happens, checkpoint of an old batch actually has the 
> latest information, but we won't recovery from it. Then we may see some RDD 
> checkpoint file missing exception during checkpoint recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11740) Fix DStream checkpointing logic to prevent failures during checkpoint recovery

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11740:


Assignee: Apache Spark

> Fix DStream checkpointing logic to prevent failures during checkpoint recovery
> --
>
> Key: SPARK-11740
> URL: https://issues.apache.org/jira/browse/SPARK-11740
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> We will do checkpoint when generating a batch and completing a batch. When 
> the processing time of a batch is greater than the batch interval, 
> checkpointing for completing an old batch may run after checkpointing of a 
> new batch. If this happens, checkpoint of an old batch actually has the 
> latest information, but we won't recovery from it. Then we may see some RDD 
> checkpoint file missing exception during checkpoint recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11740) Fix DStream checkpointing logic to prevent failures during checkpoint recovery

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11740:


Assignee: (was: Apache Spark)

> Fix DStream checkpointing logic to prevent failures during checkpoint recovery
> --
>
> Key: SPARK-11740
> URL: https://issues.apache.org/jira/browse/SPARK-11740
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> We will do checkpoint when generating a batch and completing a batch. When 
> the processing time of a batch is greater than the batch interval, 
> checkpointing for completing an old batch may run after checkpointing of a 
> new batch. If this happens, checkpoint of an old batch actually has the 
> latest information, but we won't recovery from it. Then we may see some RDD 
> checkpoint file missing exception during checkpoint recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11740) Fix DStream checkpointing logic to prevent failures during checkpoint recovery

2015-11-13 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-11740:


 Summary: Fix DStream checkpointing logic to prevent failures 
during checkpoint recovery
 Key: SPARK-11740
 URL: https://issues.apache.org/jira/browse/SPARK-11740
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


We will do checkpoint when generating a batch and completing a batch. When the 
processing time of a batch is greater than the batch interval, checkpointing 
for completing an old batch may run after checkpointing of a new batch. If this 
happens, checkpoint of an old batch actually has the latest information, but we 
won't recovery from it. Then we may see some RDD checkpoint file missing 
exception during checkpoint recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7308:
-
Assignee: Davies Liu

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.0
>
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11739) Dead SQLContext may not be cleared

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11739:


Assignee: Apache Spark  (was: Davies Liu)

> Dead SQLContext may not be cleared
> --
>
> Key: SPARK-11739
> URL: https://issues.apache.org/jira/browse/SPARK-11739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> The onApplicationEnd callback may not called for a SQLContext, then it will 
> sit there forever.
> We should clear them in a more  robust way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11739) Dead SQLContext may not be cleared

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11739:


Assignee: Davies Liu  (was: Apache Spark)

> Dead SQLContext may not be cleared
> --
>
> Key: SPARK-11739
> URL: https://issues.apache.org/jira/browse/SPARK-11739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The onApplicationEnd callback may not called for a SQLContext, then it will 
> sit there forever.
> We should clear them in a more  robust way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11739) Dead SQLContext may not be cleared

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004807#comment-15004807
 ] 

Apache Spark commented on SPARK-11739:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9706

> Dead SQLContext may not be cleared
> --
>
> Key: SPARK-11739
> URL: https://issues.apache.org/jira/browse/SPARK-11739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The onApplicationEnd callback may not called for a SQLContext, then it will 
> sit there forever.
> We should clear them in a more  robust way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11739) Dead SQLContext may not be cleared

2015-11-13 Thread Davies Liu (JIRA)

Davies Liu created SPARK-11739:
--

 Summary: Dead SQLContext may not be cleared
 Key: SPARK-11739
 URL: https://issues.apache.org/jira/browse/SPARK-11739
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Davies Liu
Assignee: Davies Liu


The onApplicationEnd callback may not called for a SQLContext, then it will sit 
there forever.

We should clear them in a more  robust way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004798#comment-15004798
 ] 

Imran Rashid commented on SPARK-7308:
-

yeah I think you are right, I just marked it as fixed.

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
> Fix For: 1.5.3, 1.6.0
>
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-7308.
-
   Resolution: Fixed
 Assignee: (was: Imran Rashid)
Fix Version/s: 1.6.0
   1.5.3

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
> Fix For: 1.5.3, 1.6.0
>
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11720:


Assignee: Apache Spark

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Assignee: Apache Spark
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004791#comment-15004791
 ] 

Apache Spark commented on SPARK-11720:
--

User 'JihongMA' has created a pull request for this issue:
https://github.com/apache/spark/pull/9705

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11720:


Assignee: (was: Apache Spark)

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004783#comment-15004783
 ] 

Andrew Or commented on SPARK-7308:
--

Should this still be open given that all associated JIRAs are closed? I think 
we've already established that there's no bullet-proof way to do this on the 
scheduler side so we need to make the write side robust.

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Description: 
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is resolved through 

  was:When stages get retried, a task may have more than one attempt running at 
the same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.


> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.
> This is resolved through 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Description: 
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is finally resolved through https://github.com/apache/spark/pull/9610, 
which uses the first writer wins approach.

  was:
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is resolved through 


> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.
> This is finally resolved through https://github.com/apache/spark/pull/9610, 
> which uses the first writer wins approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-7829.
--
  Resolution: Fixed
Assignee: Davies Liu  (was: Imran Rashid)
   Fix Version/s: 1.6.0
  1.5.3
Target Version/s: 1.5.3, 1.6.0

> SortShuffleWriter writes inconsistent data & index files on stage retry
> ---
>
> Key: SPARK-7829
> URL: https://issues.apache.org/jira/browse/SPARK-7829
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.0
>
>
> When a stage is retried, even if a shuffle map task was successful, it may 
> get retried in any case.  If it happens to get scheduled on the same 
> executor, the old data file is *appended*, while the index file still assumes 
> the data starts in position 0.  This leads to an apparently corrupt shuffle 
> map output, since when the data file is read, the index file points to the 
> wrong location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry

2015-11-13 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004780#comment-15004780
 ] 

Andrew Or commented on SPARK-7829:
--

I believe this is now fixed due to https://github.com/apache/spark/pull/9610. 
Let me know if this is not the case.

> SortShuffleWriter writes inconsistent data & index files on stage retry
> ---
>
> Key: SPARK-7829
> URL: https://issues.apache.org/jira/browse/SPARK-7829
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 1.5.3, 1.6.0
>
>
> When a stage is retried, even if a shuffle map task was successful, it may 
> get retried in any case.  If it happens to get scheduled on the same 
> executor, the old data file is *appended*, while the index file still assumes 
> the data starts in position 0.  This leads to an apparently corrupt shuffle 
> map output, since when the data file is read, the index file points to the 
> wrong location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11738) Make array orderable

2015-11-13 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11738:


 Summary: Make array orderable
 Key: SPARK-11738
 URL: https://issues.apache.org/jira/browse/SPARK-11738
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11705) Eliminate unnecessary Cartesian Join

2015-11-13 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004744#comment-15004744
 ] 

Zhan Zhang commented on SPARK-11705:


simple reproduce step:
import sqlContext.implicits._
case class SimpleRecord(key: Int, value: String)
def withDF(name: String) = {
  val df =  sc.parallelize((0 until 10).map(x => SimpleRecord(x, 
s"record_$x"))).toDF()
  df.registerTempTable(name)
}
withDF("p")
withDF("s")
withDF("l")

val d = sqlContext.sql(s"select p.key, p.value, s.value, l.value from p, s, l 
where l.key = s.key and p.key = l.key")
d.queryExecution.sparkPlan

res15: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [key#0,value#1,value#3,value#5]
 SortMergeJoin [key#2,key#0], [key#4,key#4]
  CartesianProduct
   Scan PhysicalRDD[key#0,value#1]
   Scan PhysicalRDD[key#2,value#3]
  Scan PhysicalRDD[key#4,value#5]


val d1 = sqlContext.sql(s"select p.key, p.value, s.value, l.value from s, l, p 
where l.key = s.key and p.key = l.key")
d1.queryExecution.sparkPlan

res16: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [key#0,value#1,value#3,value#5]
 SortMergeJoin [key#4], [key#0]
  TungstenProject [key#4,value#5,value#3]
   SortMergeJoin [key#2], [key#4]
Scan PhysicalRDD[key#2,value#3]
Scan PhysicalRDD[key#4,value#5]
  Scan PhysicalRDD[key#0,value#1]

> Eliminate unnecessary Cartesian Join
> 
>
> Key: SPARK-11705
> URL: https://issues.apache.org/jira/browse/SPARK-11705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> When we have some queries similar to following (don’t remember the exact 
> form):
> select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 
> = d.key3
> There will be a cartesian join between a and b. But if we just simply change 
> the table order, for example from a, c, b, d, such cartesian join are 
> eliminated.
> Without such manual tuning, the query will never finish if a, b are big. But 
> we should not relies on such manual optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11336) Include a link to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11336.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9320
[https://github.com/apache/spark/pull/9320]

> Include a link to the source file in generated example code
> ---
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include a link to the example source file at the bottom 
> of each code example. So if users want to try them, they know where to find. 
> The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11336) Include path to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11336:
--
Description: It would be nice to include -a link- the path to the example 
source file at the bottom of each code example. So if users want to try them, 
they know where to find. The font size should be small and not interrupting.  
(was: It would be nice to include a link to the example source file at the 
bottom of each code example. So if users want to try them, they know where to 
find. The font size should be small and not interrupting.)

> Include path to the source file in generated example code
> -
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include -a link- the path to the example source file at 
> the bottom of each code example. So if users want to try them, they know 
> where to find. The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11336) Include path to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11336:
--
Summary: Include path to the source file in generated example code  (was: 
Include a link to the source file in generated example code)

> Include path to the source file in generated example code
> -
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include a link to the example source file at the bottom 
> of each code example. So if users want to try them, they know where to find. 
> The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11737:


Assignee: Apache Spark  (was: Davies Liu)

> String may not be serialized correctly with Kyro
> 
>
> Key: SPARK-11737
> URL: https://issues.apache.org/jira/browse/SPARK-11737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>
> When run in cluster mode, the driver may have different memory (and configs) 
> than executor, also if Kyro is used, then string can not be collected back to 
> driver:
> {code}
> >>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
> ++
> |repeat(cast(id as string),9)|
> ++
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004734#comment-15004734
 ] 

Apache Spark commented on SPARK-11737:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9704

> String may not be serialized correctly with Kyro
> 
>
> Key: SPARK-11737
> URL: https://issues.apache.org/jira/browse/SPARK-11737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> When run in cluster mode, the driver may have different memory (and configs) 
> than executor, also if Kyro is used, then string can not be collected back to 
> driver:
> {code}
> >>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
> ++
> |repeat(cast(id as string),9)|
> ++
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11737:


Assignee: Davies Liu  (was: Apache Spark)

> String may not be serialized correctly with Kyro
> 
>
> Key: SPARK-11737
> URL: https://issues.apache.org/jira/browse/SPARK-11737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> When run in cluster mode, the driver may have different memory (and configs) 
> than executor, also if Kyro is used, then string can not be collected back to 
> driver:
> {code}
> >>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
> ++
> |repeat(cast(id as string),9)|
> ++
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-13 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11672.
---
Resolution: Fixed

Issue resolved by pull request 9694
[https://github.com/apache/spark/pull/9694]

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Target Version/s: 1.5.3, 1.6.0  (was: 1.5.2, 1.6.0)

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Fix Version/s: (was: 1.5.2)
   1.5.3

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Fix Version/s: 1.5.2

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.2, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Davies Liu (JIRA)

Davies Liu created SPARK-11737:
--

 Summary: String may not be serialized correctly with Kyro
 Key: SPARK-11737
 URL: https://issues.apache.org/jira/browse/SPARK-11737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.6.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical


When run in cluster mode, the driver may have different memory (and configs) 
than executor, also if Kyro is used, then string can not be collected back to 
driver:

{code}
>>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
++
|repeat(cast(id as string),9)|
++
|   0|
|   1|
|   2|
|   3|
|   4|
|   5|
|   6|
|   7|
|   8|
|   9|
++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11736:


Assignee: Apache Spark  (was: Yin Huai)

> Add MonotonicallyIncreasingID to function registry
> --
>
> Key: SPARK-11736
> URL: https://issues.apache.org/jira/browse/SPARK-11736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11736:


Assignee: Yin Huai  (was: Apache Spark)

> Add MonotonicallyIncreasingID to function registry
> --
>
> Key: SPARK-11736
> URL: https://issues.apache.org/jira/browse/SPARK-11736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004662#comment-15004662
 ] 

Apache Spark commented on SPARK-11736:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9703

> Add MonotonicallyIncreasingID to function registry
> --
>
> Key: SPARK-11736
> URL: https://issues.apache.org/jira/browse/SPARK-11736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-13 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11736:


 Summary: Add MonotonicallyIncreasingID to function registry
 Key: SPARK-11736
 URL: https://issues.apache.org/jira/browse/SPARK-11736
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10712) JVM crashes with spark.sql.tungsten.enabled = true

2015-11-13 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004636#comment-15004636
 ] 

Davies Liu commented on SPARK-10712:


How is you small table looks like? Does 1.5.2-RC2 still have this issue?

> JVM crashes with spark.sql.tungsten.enabled = true
> --
>
> Key: SPARK-10712
> URL: https://issues.apache.org/jira/browse/SPARK-10712
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: 1 node - Linux, 64GB ram, 8 core
>Reporter: Mauro Pirrone
>Priority: Critical
>
> When turning on tungsten, I get the following error when executing a 
> query/job with a few joins. When tungsten is turned off, the error does not 
> appear. Also note that tungsten works for me in other cases.
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffadaf59200, pid=7598, tid=140710015645440
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 
> 1.8.0_45-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x7eb200]
> #
> # Core dump written. Default location: //core or core.7598 (max size 100 
> kB). To ensure a full core dump, try "ulimit -c unlimited" before starting 
> Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid7598.log
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x7ff7902e7800):  JavaThread "broadcast-hash-join-1" 
> daemon [_thread_in_vm, id=16548, stack(0x7ff66bd98000,0x7ff66be99000)]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
> 0x00069f572b10
> Registers:
> RAX=0x00069f672b08, RBX=0x7ff7902e7800, RCX=0x000394132140, 
> RDX=0xfffe0004
> RSP=0x7ff66be97048, RBP=0x7ff66be970a0, RSI=0x000394032148, 
> RDI=0x00069f572b10
> R8 =0x7ff66be970d0, R9 =0x0028, R10=0x7ff79cc0e1e7, 
> R11=0x7ff79cc0e198
> R12=0x7ff66be970c0, R13=0x7ff66be970d0, R14=0x0028, 
> R15=0x30323048
> RIP=0x7ff7b0dae200, EFLAGS=0x00010282, CSGSFS=0xe033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x7ff66be97048)
> 0x7ff66be97048:   7ff7b1042b1a 7ff7902e7800
> 0x7ff66be97058:   7ff7 7ff7902e7800
> 0x7ff66be97068:   7ff7902e7800 7ff7ad2846a0
> 0x7ff66be97078:   7ff7897048d8 
> 0x7ff66be97088:   7ff66be97110 7ff66be971f0
> 0x7ff66be97098:   7ff7902e7800 7ff66be970f0
> 0x7ff66be970a8:   7ff79cc0e261 0010
> 0x7ff66be970b8:   000390c04048 00066f24fac8
> 0x7ff66be970c8:   7ff7902e7800 000394032120
> 0x7ff66be970d8:   7ff7902e7800 7ff66f971af0
> 0x7ff66be970e8:   7ff7902e7800 7ff66be97198
> 0x7ff66be970f8:   7ff79c9d4c4d 7ff66a454b10
> 0x7ff66be97108:   7ff79c9d4c4d 0010
> 0x7ff66be97118:   7ff7902e5a90 0028
> 0x7ff66be97128:   7ff79c9d4760 000394032120
> 0x7ff66be97138:   30323048 7ff66be97160
> 0x7ff66be97148:   00066f24fac8 000390c04048
> 0x7ff66be97158:   7ff66be97158 7ff66f978eeb
> 0x7ff66be97168:   7ff66be971f0 7ff66f9791c8
> 0x7ff66be97178:   7ff668e90c60 7ff66f978f60
> 0x7ff66be97188:   7ff66be97110 7ff66be971b8
> 0x7ff66be97198:   7ff66be97238 7ff79c9d4c4d
> 0x7ff66be971a8:   0010 
> 0x7ff66be971b8:   38363130 38363130
> 0x7ff66be971c8:   0028 7ff66f973388
> 0x7ff66be971d8:   000394032120 30323048
> 0x7ff66be971e8:   000665823080 00066f24fac8
> 0x7ff66be971f8:   7ff66be971f8 7ff66f973357
> 0x7ff66be97208:   7ff66be97260 7ff66f976fe0
> 0x7ff66be97218:    7ff66f973388
> 0x7ff66be97228:   7ff66b

[jira] [Commented] (SPARK-11735) Add a check in the constructor of SqlContext to make sure the SparkContext is not stopped

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004627#comment-15004627
 ] 

Apache Spark commented on SPARK-11735:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9702

> Add a check in the constructor of SqlContext to make sure the SparkContext is 
> not stopped
> -
>
> Key: SPARK-11735
> URL: https://issues.apache.org/jira/browse/SPARK-11735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11735) Add a check in the constructor of SqlContext to make sure the SparkContext is not stopped

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11735:


Assignee: Apache Spark

> Add a check in the constructor of SqlContext to make sure the SparkContext is 
> not stopped
> -
>
> Key: SPARK-11735
> URL: https://issues.apache.org/jira/browse/SPARK-11735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2015-11-13 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004626#comment-15004626
 ] 

Davies Liu commented on SPARK-9278:
---

cc [~lian cheng]

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Priority: Blocker
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11735) Add a check in the constructor of SqlContext to make sure the SparkContext is not stopped

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11735:


Assignee: (was: Apache Spark)

> Add a check in the constructor of SqlContext to make sure the SparkContext is 
> not stopped
> -
>
> Key: SPARK-11735
> URL: https://issues.apache.org/jira/browse/SPARK-11735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11643:
--

Assignee: Davies Liu

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Davies Liu
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004622#comment-15004622
 ] 

Apache Spark commented on SPARK-11643:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9701

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Davies Liu
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11643:


Assignee: Apache Spark  (was: Davies Liu)

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Apache Spark
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11643:


Assignee: Davies Liu  (was: Apache Spark)

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Davies Liu
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11735) Add a check in the constructor of SqlContext to make sure the SparkContext is not stopped

2015-11-13 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11735:


 Summary: Add a check in the constructor of SqlContext to make sure 
the SparkContext is not stopped
 Key: SPARK-11735
 URL: https://issues.apache.org/jira/browse/SPARK-11735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004597#comment-15004597
 ] 

Apache Spark commented on SPARK-11734:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9700

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11734:


Assignee: Reynold Xin  (was: Apache Spark)

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11734:


Assignee: Apache Spark  (was: Reynold Xin)

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-11734:
---

 Summary: Move reference sort into test and standardize on 
TungstenSort
 Key: SPARK-11734
 URL: https://issues.apache.org/jira/browse/SPARK-11734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2344:
---

Assignee: (was: Apache Spark)

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004589#comment-15004589
 ] 

Apache Spark commented on SPARK-2344:
-

User 'acflorea' has created a pull request for this issue:
https://github.com/apache/spark/pull/9699

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2344:
---

Assignee: Apache Spark

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Assignee: Apache Spark
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11727.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9693
[https://github.com/apache/spark/pull/9693]

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11724) Casting integer types to timestamp has unexpected semantics

2015-11-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11724:

Assignee: Nong Li

> Casting integer types to timestamp has unexpected semantics
> ---
>
> Key: SPARK-11724
> URL: https://issues.apache.org/jira/browse/SPARK-11724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>Assignee: Nong Li
>Priority: Minor
>  Labels: releasenotes
>
> Casting from integer types to timestamp treats the source int as being in 
> millis. Casting from timestamp to integer types creates the result in 
> seconds. This leads to behavior like:
> {code}
> scala> sql("select cast(cast (1234 as timestamp) as bigint)").show
> +---+
> |_c0|
> +---+
> |  1|
> +---+
> {code}
> Double's on the other hand treat it as seconds when casting to and from:
> {code}
> scala> sql("select cast(cast (1234.5 as timestamp) as double)").show
> +--+
> |   _c0|
> +--+
> |1234.5|
> +--+
> {code}
> This also breaks some other functions which return long in seconds, in 
> particular, unix_timestamp.
> {code}
> scala> sql("select cast(unix_timestamp() as timestamp)").show
> ++
> | _c0|
> ++
> |1970-01-17 10:03:...|
> ++
> scala> sql("select cast(unix_timestamp() *1000 as timestamp)").show
> ++
> | _c0|
> ++
> |2015-11-12 23:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11724) Casting integer types to timestamp has unexpected semantics

2015-11-13 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11724:

Labels: releasenotes  (was: )

> Casting integer types to timestamp has unexpected semantics
> ---
>
> Key: SPARK-11724
> URL: https://issues.apache.org/jira/browse/SPARK-11724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>Priority: Minor
>  Labels: releasenotes
>
> Casting from integer types to timestamp treats the source int as being in 
> millis. Casting from timestamp to integer types creates the result in 
> seconds. This leads to behavior like:
> {code}
> scala> sql("select cast(cast (1234 as timestamp) as bigint)").show
> +---+
> |_c0|
> +---+
> |  1|
> +---+
> {code}
> Double's on the other hand treat it as seconds when casting to and from:
> {code}
> scala> sql("select cast(cast (1234.5 as timestamp) as double)").show
> +--+
> |   _c0|
> +--+
> |1234.5|
> +--+
> {code}
> This also breaks some other functions which return long in seconds, in 
> particular, unix_timestamp.
> {code}
> scala> sql("select cast(unix_timestamp() as timestamp)").show
> ++
> | _c0|
> ++
> |1970-01-17 10:03:...|
> ++
> scala> sql("select cast(unix_timestamp() *1000 as timestamp)").show
> ++
> | _c0|
> ++
> |2015-11-12 23:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 193 matches

Mail list logo