date:20160310

[jira] [Created] (SPARK-13799) Add memory back pressure for Spark Streaming

2016-03-10 Thread Liwei Lin (JIRA)

Liwei Lin created SPARK-13799:
-

 Summary: Add memory back pressure for Spark Streaming
 Key: SPARK-13799
 URL: https://issues.apache.org/jira/browse/SPARK-13799
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Liwei Lin


The back pressure mechanism added in Spark 1.5.0 is great in keeping the whole 
application in a stable state. But even with back pressure enabled, we have, 
for some times, experienced Receiver side’s OOME when facing sudden data burst.

This proposes to add memory back pressure at the Receiver side, as a complement 
to the current back pressure mechanism, to better keep the application in a 
stable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13799) Add memory back pressure for Spark Streaming

2016-03-10 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188890#comment-15188890
 ] 

Liwei Lin commented on SPARK-13799:
---

Will soon upload a reproducer.
We're working on this; will soon issue a PR.

> Add memory back pressure for Spark Streaming
> 
>
> Key: SPARK-13799
> URL: https://issues.apache.org/jira/browse/SPARK-13799
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>
> The back pressure mechanism added in Spark 1.5.0 is great in keeping the 
> whole application in a stable state. But even with back pressure enabled, we 
> have, for some times, experienced Receiver side’s OOME when facing sudden 
> data burst.
> This proposes to add memory back pressure at the Receiver side, as a 
> complement to the current back pressure mechanism, to better keep the 
> application in a stable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188893#comment-15188893
 ] 

Apache Spark commented on SPARK-12653:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11630

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8489) Add regression tests for SPARK-8470

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188894#comment-15188894
 ] 

Apache Spark commented on SPARK-8489:
-

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11630

> Add regression tests for SPARK-8470
> ---
>
> Key: SPARK-8489
> URL: https://issues.apache.org/jira/browse/SPARK-8489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> See SPARK-8470 for more detail. Basically the Spark Hive code silently 
> overwrites the context class loader populated in SparkSubmit, resulting in 
> certain classes missing when we do reflection in `SQLContext#createDataFrame`.
> That issue is already resolved in https://github.com/apache/spark/pull/6891, 
> but we should add a regression test for the specific manifestation of the bug 
> in SPARK-8470.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12653:


Assignee: Apache Spark

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12653:


Assignee: (was: Apache Spark)

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188897#comment-15188897
 ] 

Dongjoon Hyun commented on SPARK-12653:
---

Hi, [~rxin].
As you guessed, I can re-enable the test by just recompiling the jar on the 
master.
Let's see the Jenkins result.

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Description: 
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Specifically, this JIRA propose to:

1. add in `RateLimiter` a data structure keeping history of rate limit changes, 
so that calculating the upper bound of how many records we can receive in a 
block interval is possible;
2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
be transferred on wire to the driver side's `ReceivedBlockTracker`);
3. make changes in `StreamingJobProgressListener` and related places, so that 
the aggregated `numRecordsLimit` information for every batch can be calculated;
4. make changes in `StreamingPage` and related places, so two or more lines can 
be drawn on a single statistics graph.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!

  was:
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Specifically, this JIRA propose to:

1. add in `RateLimiter` a data structure keeping history of rate limit changes, 
so that calculating the upper bound of how many records we can receive in a 
block interval is possible;
2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
be transferred on wire to the driver side's `ReceivedBlockTracker`);
3. make changes in `StreamingJobProgressListener` and related places, so that 
the aggregated `numRecordsLimit` information for every batch can be calculated;
4. make changes in `StreamingPage` and related places, so two or more lines can 
be drawn on a single statistics graph.

[Screenshots]

without back pressure:
!https://issues.apache.org/jira/secure/attachment/12790928/1.png!

with back pressure:
!https://issues.apache.org/jira/secure/attachment/12790929/2.png!


> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Specifically, this JIRA propose to:
> 1. add in `RateLimiter` a data structure keeping history of rate limit 
> changes, so that calculating the upper bound of how many records we can 
> receive in a block interval is possible;
> 2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
> generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
> be transferred on wire to the driver side's `ReceivedBlockTracker`);
> 3. make changes in `StreamingJobProgressListener` and related places, so that 
> the aggregated `numRecordsLimit` information for every batch can be 
> calculated;
> 4. make changes in `StreamingPage` and related places, so two or more lines 
> can be drawn on a single statistics graph.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Attachment: (was: 1.png)

> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Specifically, this JIRA propose to:
> 1. add in `RateLimiter` a data structure keeping history of rate limit 
> changes, so that calculating the upper bound of how many records we can 
> receive in a block interval is possible;
> 2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
> generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
> be transferred on wire to the driver side's `ReceivedBlockTracker`);
> 3. make changes in `StreamingJobProgressListener` and related places, so that 
> the aggregated `numRecordsLimit` information for every batch can be 
> calculated;
> 4. make changes in `StreamingPage` and related places, so two or more lines 
> can be drawn on a single statistics graph.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13771) Eliminate child of project if the project with no references to its child

2016-03-10 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-13771.
-
Resolution: Won't Fix

> Eliminate child of project if the project with no references to its child
> -
>
> Key: SPARK-13771
> URL: https://issues.apache.org/jira/browse/SPARK-13771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> This is a corner case that a Project node can have no references to its 
> child. E.g.,
> {code}
> val input = LocalRelation('key.int, 'value.string)
> Project(Literal(1).as("1") :: Nil, input)
> {code}
> We can actually replace the input with a dummy OneRowRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Attachment: (was: 2.png)

> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Specifically, this JIRA propose to:
> 1. add in `RateLimiter` a data structure keeping history of rate limit 
> changes, so that calculating the upper bound of how many records we can 
> receive in a block interval is possible;
> 2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
> generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
> be transferred on wire to the driver side's `ReceivedBlockTracker`);
> 3. make changes in `StreamingJobProgressListener` and related places, so that 
> the aggregated `numRecordsLimit` information for every batch can be 
> calculated;
> 4. make changes in `StreamingPage` and related places, so two or more lines 
> can be drawn on a single statistics graph.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Description: 
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!

  was:
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Specifically, this JIRA propose to:

1. add in `RateLimiter` a data structure keeping history of rate limit changes, 
so that calculating the upper bound of how many records we can receive in a 
block interval is possible;
2. add  the `numRecordsLimit` information into the path from `BlockGenerator` 
generates a `Block` to the `ReceivedBlockInfo` (so that `numRecordsLimit` can 
be transferred on wire to the driver side's `ReceivedBlockTracker`);
3. make changes in `StreamingJobProgressListener` and related places, so that 
the aggregated `numRecordsLimit` information for every batch can be calculated;
4. make changes in `StreamingPage` and related places, so two or more lines can 
be drawn on a single statistics graph.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!


> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13800) Hive conf will be modified on multi-beeline connect to thriftserver

2016-03-10 Thread Weizhong (JIRA)

Weizhong created SPARK-13800:


 Summary: Hive conf will be modified on multi-beeline connect to 
thriftserver
 Key: SPARK-13800
 URL: https://issues.apache.org/jira/browse/SPARK-13800
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Weizhong
Priority: Minor


1. Start ThriftServer
2. beeline 1 connect to TS, then run
{code:sql}
create database if not exists hive_bin_partitioned_orc_3;
use hive_bin_partitioned_orc_3;

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1;
set spark.sql.autoBroadcastJoinThreshold=-1;

drop table if exists store_returns;
create table store_returns(
sr_returned_date_sk   int,
sr_return_time_sk int,
sr_item_skint,
sr_customer_skint,
sr_cdemo_sk   int,
sr_hdemo_sk   int,
sr_addr_skint,
sr_store_sk   int,
sr_reason_sk  int,
sr_ticket_number  int,
sr_return_quantityint,
sr_return_amt float  ,
sr_return_tax float  ,
sr_return_amt_inc_tax float  ,
sr_feefloat  ,
sr_return_ship_cost   float  ,
sr_refunded_cash  float  ,
sr_reversed_chargefloat  ,
sr_store_credit   float  ,
sr_net_loss   float
)
partitioned by (sr_returned_date string)
stored as orc;

insert overwrite table store_returns partition (sr_returned_date) 
select
  sr.sr_returned_date_sk,
  sr.sr_return_time_sk,
  sr.sr_item_sk,
  sr.sr_customer_sk,
  sr.sr_cdemo_sk,
  sr.sr_hdemo_sk,
  sr.sr_addr_sk,
  sr.sr_store_sk,
  sr.sr_reason_sk,
  sr.sr_ticket_number,
  sr.sr_return_quantity,
  sr.sr_return_amt,
  sr.sr_return_tax,
  sr.sr_return_amt_inc_tax,
  sr.sr_fee,
  sr.sr_return_ship_cost,
  sr.sr_refunded_cash,
  sr.sr_reversed_charge,
  sr.sr_store_credit,
  sr.sr_net_loss,
  dd.d_date as sr_returned_date 
from tpcds_text_3.store_returns sr
join tpcds_text_3.date_dim dd
on (sr.sr_returned_date_sk = dd.d_date_sk);
{code}
3. beeline 2 connect to TS, then run
{code:sql}
show tables;
{code}

*INSERT ... SELECT failed as hive.exec.max.dynamic.partitions have been 
modified to default value(1000).*

{noformat}
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:602)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:895)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:895)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:895)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:322)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:269)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:268)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:311)
at 
org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:894)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anon$1.run(InsertIntoHiveTable.scala:228)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anon$1.run(InsertIntoHiveTable.scala:226)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1711)
... 25 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
partitions created is 1823, which is more than 1000. To solve this try to set 
hive.exec.max.dynamic.partitions to at least 1823.
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1584)
... 43 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13629:
---
Assignee: yuhao yang

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13629:
---
Shepherd: Nick Pentreath

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13672:
---
Shepherd: Nick Pentreath
Assignee: zhengruifeng

> Add python examples of BisectingKMeans in ML and MLLIB
> --
>
> Key: SPARK-13672
> URL: https://issues.apache.org/jira/browse/SPARK-13672
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13600) Use approxQuantile from DataFrame stats in QuantileDiscretizer

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13600:
---
Shepherd: Nick Pentreath

> Use approxQuantile from DataFrame stats in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> For consistency and code reuse, QuantileDiscretizer should use approxQuantile 
> to find splits in the data rather than implement it's own method.
> Additionally, making this change should remedy a bug where 
> QuantileDiscretizer fails to calculate the correct splits in certain 
> circumstances, resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11108) OneHotEncoder should support other numeric input types

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-11108:
---
Assignee: Seth Hendrickson

> OneHotEncoder should support other numeric input types
> --
>
> Key: SPARK-11108
> URL: https://issues.apache.org/jira/browse/SPARK-11108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> See parent JIRA for more info.
> Also see [SPARK-10513] for motivation behind issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11108) OneHotEncoder should support other numeric input types

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-11108:
---
Shepherd: Nick Pentreath

> OneHotEncoder should support other numeric input types
> --
>
> Key: SPARK-11108
> URL: https://issues.apache.org/jira/browse/SPARK-11108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> See parent JIRA for more info.
> Also see [SPARK-10513] for motivation behind issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13718) Scheduler "creating" straggler node

2016-03-10 Thread Ioannis Deligiannis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioannis Deligiannis updated SPARK-13718:

Attachment: TestIssue.java

Test Case: Passing JavaSparkContext to test method and tuning some parameters 
to match the cluster should reproduce the problem

> Scheduler "creating" straggler node 
> 
>
> Key: SPARK-13718
> URL: https://issues.apache.org/jira/browse/SPARK-13718
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 1.3.1
> Environment: Spark 1.3.1
> MapR-FS
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 48 cores & 512 RAM / node
> Data Replication factor of 3
> Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM.
>Reporter: Ioannis Deligiannis
>Priority: Minor
> Attachments: TestIssue.java, spark_struggler.jpg
>
>
> *Data:*
> * Assume an even distribution of data across the cluster with a replication 
> factor of 3.
> * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3 
> requests can be executed concurrently(-ish) )
> *Action:*
> * Action is a simple sequence of map/filter/reduce. 
> * The action operates upon and returns a small subset of data (following the 
> full map over the data).
> * Data are 1 x cached serialized in memory (Kryo), so calling the action  
> should not hit the disk under normal conditions.
> * Action network usage is low as it returns a small number of aggregated 
> results and does not require excessive shuffling
> * Under low or moderate load, each action is expected to complete in less 
> than 2 seconds
> *H/W Outlook*
> When the action is called in high numbers, initially the cluster CPU gets 
> close to 100% (which is expected & intended). 
> After a while, the cluster utilization reduces significantly with only one 
> (struggler) node having 100% CPU and fully utilized network.
> *Diagnosis:*
> 1. Attached a profiler to the driver and executors to monitor GC or I/O 
> issues and everything is normal under low or heavy usage. 
> 2. Cluster network usage is very low
> 3. No issues on Spark UI except that tasks begin to  move from LOCAL to ANY
> *Cause (Corrected as found details in code):* 
> 1. Node 'H' is doing marginally more work than the rest (being a little 
> slower and at almost 100% CPU)
> 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the 
> task to other nodes 
> 3. One of the nodes 'X' that accepted the task will try to access the data 
> from node 'H' HDD. This adds Network I/O to node and also some extra CPU for 
> I/O.
> 4. 'X' time to complete increases ~5x as it goes over Network 
> 5. Eventually, every node will have a task that is waiting to fetch that 
> specific partition from node 'H' so cluster is basically blocked on a single 
> node
> What I managed to figure out from the code is that this is because if an RDD 
> is cached, it will make use of BlockManager.getRemote() and will not 
> recompute the DAG part that resulted in this RDD and hence always hit the 
> node that has cached the RDD.
> * Proposed Fix *
> I have not worked with Scala & Spark source code enough to propose a code 
> fix, but on a high level, when a task hits the 'spark.locality.wait' timeout, 
> it could make use of a new configuration e.g. 
> recomputeRddAfterLocalityTimeout instead of always trying to get the cached 
> RDD. This would be very useful if it could also be manually set on the RDD.
> *Workaround*
> Playing with 'spark.locality.wait' values, there is a deterministic value 
> depending on partitions and config where the problem ceases to exist.
> *PS1* : Don't have enough Scala skils to follow the issue or propose a fix, 
> but I hope that this has enough information to make sense.
> *PS2* : Debugging this issue made me realize that there can be a lot of 
> use-cases that trigger this behaviour



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189079#comment-15189079
 ] 

Apache Spark commented on SPARK-13663:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11631

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13663:
--
Assignee: Sean Owen

(I'll assign to "yy2016" if he/she will post his/her JIRA handle.)

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Sean Owen
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13691) Scala and Python generate inconsistent results

2016-03-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189084#comment-15189084
 ] 

Sean Owen commented on SPARK-13691:
---

Fair point ... but hm, I wonder if we can change the behavior significantly at 
this point?
Is this specific to running in local mode BTW?

> Scala and Python generate inconsistent results
> --
>
> Key: SPARK-13691
> URL: https://issues.apache.org/jira/browse/SPARK-13691
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13714) Another ConnectedComponents based on Max-Degree Propagation

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13714.
---
Resolution: Won't Fix

I'm going to call this WontFix unless somebody advocates strongly for actually 
*replacing* the current implementation with this one. More realistic is to try 
to merge them. Adding a second one doesn't seem great.

> Another ConnectedComponents based on Max-Degree Propagation
> ---
>
> Key: SPARK-13714
> URL: https://issues.apache.org/jira/browse/SPARK-13714
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
> which is sensitive to the place of Min-VertexId.
> While this implementation is based on Max-Degree Propagation.
> First, the degree graph is computed. And in the pregel progress, the vertex 
> with the max degree in a CC is the start point of propagation.
> This new method has advantages over the old one:
> 1, The convergence is only determined by the structs of CC, and is robust to 
> the place of vertex with Min-ID.
> 2, For spherical CCs in which there may be a concept like 'center', it can 
> accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
> the old CC need 4 supersteps, while the new one only need 2 supersteps.
> 3, If we limit the number of iteration, the new method tend to generate more 
> acceptable results.
> 4, The output for each CC is the vertex with max degree in it, which may be 
> more meaningful. And because the vertex-ID is nominal in most cases, the 
> vertex with min-ID in a CC is somewhat meanless.
> But there are still two disadvantages:
> 1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> 
> (Long, Int)
> 2,For graph with simple CCs, it may be slower than old one. Because it need a 
> extra degree computation.
> The api is the same like ConnectedComponents:
> val graph = ...
> val cc = graph.ConnectedComponentsWithDegree(100)
> or
> val cc = ConnectedComponentsWithDegree.run(graph, 100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13788) CholeskyDecomposition has side effects!

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13788:
--
Fix Version/s: (was: 2.0.0)

(Don't set fix version; I think this is invalid anyway)

> CholeskyDecomposition has side effects!
> ---
>
> Key: SPARK-13788
> URL: https://issues.apache.org/jira/browse/SPARK-13788
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Ehsan Mohyedin Kermani
>
> Current CholeskyDecomposition implementation calling from lapack has side 
> effects due to mutating the array A representing the matrix in packed storage 
> format (see http://www.netlib.org/lapack/lug/node123.html ) and bx the known 
> right hand side values which should be captured by cloning arrays first!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13726:
--
   Flags:   (was: Important)
Target Version/s:   (was: 1.6.2)
Priority: Major  (was: Blocker)

[~michaelmnguyen] please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA.

This is not a Blocker, and only committers should set Target version. There 
isn't much information here.

> Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
> 
>
> Key: SPARK-13726
> URL: https://issues.apache.org/jira/browse/SPARK-13726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> In Spark 1.5.2, DataFrame.registerTempTable works and  
> hiveContext.table(registerTableName) and HiveThriftServer2 see those tables.
> In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do 
> not see those tables, even though DataFrame.registerTempTable does not return 
> an error.
> Since this feature used to work in Spark 1.5.2, there is existing code that 
> breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. 
> Therefore, please have it fixed asap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13770) Document the ML feature Interaction

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13770:
--
Component/s: ML
 Documentation

> Document the ML feature Interaction
> ---
>
> Key: SPARK-13770
> URL: https://issues.apache.org/jira/browse/SPARK-13770
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Abbass Marouni
>Priority: Minor
>
> The ML feature Interaction 
> (http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html)
>  is not included in the documentation of ML features. It'd be nice to provide 
> a working example and some documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13773) UDF being applied to filtered data

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13773:
--
Component/s: SQL

> UDF being applied to filtered data 
> ---
>
> Key: SPARK-13773
> URL: https://issues.apache.org/jira/browse/SPARK-13773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: James Hammerton
>
> Give the following code:
> {code:title=ReproduceSparkBug.scala|borderStyle=solid}
> import scala.reflect.runtime.universe
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions.udf
> import org.apache.spark.sql.types.DataTypes
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.SparkConf
> object ReproduceSparkBug {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setMaster("local")
>   .setAppName("ReproduceSparkBug")
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> val schema = StructType(Array(
>   StructField("userId", DataTypes.StringType),
>   StructField("objectId", DataTypes.StringType),
>   StructField("eventName", DataTypes.StringType),
>   StructField("eventJson", DataTypes.StringType),
>   StructField("timestamp", DataTypes.LongType)))
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val df = sqlContext.read.format("com.databricks.spark.csv")
>   .option("delimiter", "\t")
>   .option("header", "false")
>   .schema(schema).load("src/test/resources/foo.txt")
> val filtered = df.filter((df("eventName")).endsWith("Created"))
> val extracted = filtered.select(filtered(EventFieldNames.ObjectId), 
>   extractorUDF(filtered("eventJson"), filtered("objectId"), 
> filtered("userId")) as "info")
> extracted.filter((extracted("info")).notEqual("NO REFS")).collect.map(r 
> => (r.getString(0), r.getString(1))).foreach(println(_))
> sc.stop()
>   }
>   def extractorUDF = udf(extractor(_: String, _: String, _: String))
>   def extractor(eventJson: String, objectId: String, userId: String): String 
> = {
> println(eventJson + ":" + objectId + ":" + userId)
> eventJson + ":" + objectId + ":" + userId
>   }
> }
> {code}
> If "src/test/resources" contains the following:
> {noformat}
> 113   0c38c6c327224e43a46f663b6424612fCreated {"field":"value"}   
> 1000
> 113   0c38c6c327224e43a46f663b6424612fLabelRemoved{"this":"should 
> not appear"}1000
> {noformat}
> Then the code outputs the following to std out:
> {noformat}
> {"field":"value"}:0c38c6c327224e43a46f663b6424612f:113
> {"field":"value"}:0c38c6c327224e43a46f663b6424612f:113
> {"this":"should not appear"}:0c38c6c327224e43a46f663b6424612f:113
> (0c38c6c327224e43a46f663b6424612f,{"field":"value"}:0c38c6c327224e43a46f663b6424612f:113)
> {noformat}
> If the first filter is cached (i.e. we write 'val filtered = 
> df.filter((df("eventName")).endsWith("Created")).cache'), then only the first 
> and last lines appear.
> What I think is happening is that the UDF is applied to the unfiltered data 
> but then the filtering is applied so the correct data is output. Also it 
> seems the UDF gets applied more than once to the data that isn't filtered for 
> some reason.
> This caused problems in my original code where some json parsing was done in 
> the UDF but was throwing exceptions because it was applied to data that 
> should have been filtered out. The original code was reading from parquet but 
> I switch to tab separated format here to make things easier to see/post.
> I suspect the bug hasn't been found hitherto since the correct results do get 
> produced in the end, and the UDF would need to cause task failures when 
> applied to the filtered data for people to notice.
> Note that I could not reproduce this unless the data was read in from a file. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13790) Speed up ColumnVector's getDecimal

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13790:
--
Component/s: SQL

> Speed up ColumnVector's getDecimal
> --
>
> Key: SPARK-13790
> URL: https://issues.apache.org/jira/browse/SPARK-13790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Priority: Minor
>
> This should reuse a decimal object for the simple case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13762) support only column names in schema string at createDataFrame

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13762:
--
Component/s: SQL

> support only column names in schema string at createDataFrame
> -
>
> Key: SPARK-13762
> URL: https://issues.apache.org/jira/browse/SPARK-13762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> for example, `createDataFrame(rdd, "a b c")`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13767:
--
Component/s: PySpark

> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 325, in _create_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 432, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> I am getting the same error executing the command : conf = 
> SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13242:
--
Assignee: Davies Liu

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>Assignee: Davies Liu
> Fix For: 1.6.2, 2.0.0
>
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13781) Use ExpressionSets in ConstraintPropagationSuite

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13781:
--
Assignee: Sameer Agarwal

> Use ExpressionSets in ConstraintPropagationSuite
> 
>
> Key: SPARK-13781
> URL: https://issues.apache.org/jira/browse/SPARK-13781
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Small follow up on https://issues.apache.org/jira/browse/SPARK-13092 to use 
> ExpressionSets as part of the verification logic in 
> ConstraintPropagationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13763:
--
Assignee: Xiao Li

> Remove Project when its projectList is Empty
> 
>
> Key: SPARK-13763
> URL: https://issues.apache.org/jira/browse/SPARK-13763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> We are using 'SELECT 1' as a dummy table, when the table is used for SQL 
> statements in which a table reference is required, but the contents of the 
> table are not important. For example, 
> {code}
> SELECT pageid, adid FROM (SELECT 1) dummyTable LATERAL VIEW 
> explode(adid_list) adTable AS adid;
> {code}
> In this case, we will see a useless Project whose projectList is empty after 
> executing ColumnPruning rule. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13648:
--
Assignee: Tim Preece

> org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on 
> IBM JDK
> 
>
> Key: SPARK-13648
> URL: https://issues.apache.org/jira/browse/SPARK-13648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Fails on vendor specific JVMs ( e.g IBM JVM )
>Reporter: Tim Preece
>Assignee: Tim Preece
>Priority: Minor
> Fix For: 1.6.2, 2.0.0
>
>
> When running the standard Spark unit tests on the IBM Java SDK the hive 
> VersionsSuite fail with the following error.
> java.lang.NoClassDefFoundError:  org.apache.hadoop.hive.cli.CliSessionState 
> when creating Hive client using classpath: ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13694) QueryPlan.expressions should always include all expressions

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13694:
--
Assignee: Wenchen Fan

> QueryPlan.expressions should always include all expressions
> ---
>
> Key: SPARK-13694
> URL: https://issues.apache.org/jira/browse/SPARK-13694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13651) Generator outputs are not resolved correctly resulting in runtime error

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13651:
--
Assignee: Dilip Biswal

> Generator outputs are not resolved correctly resulting in runtime error
> ---
>
> Key: SPARK-13651
> URL: https://issues.apache.org/jira/browse/SPARK-13651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
> sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 
> 'key2', 200)) t1 AS key, value")
> Running above repro results in :
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
>   at 
> org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
>   at 
> org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:82)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13726:
--
Target Version/s: 1.6.2

(oops actually yhuai set the target, that's fine. Restored.)

> Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
> 
>
> Key: SPARK-13726
> URL: https://issues.apache.org/jira/browse/SPARK-13726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> In Spark 1.5.2, DataFrame.registerTempTable works and  
> hiveContext.table(registerTableName) and HiveThriftServer2 see those tables.
> In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do 
> not see those tables, even though DataFrame.registerTempTable does not return 
> an error.
> Since this feature used to work in Spark 1.5.2, there is existing code that 
> breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. 
> Therefore, please have it fixed asap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-13801:
---

 Summary: DataFrame.col should return unresolved attribute
 Key: SPARK-13801
 URL: https://issues.apache.org/jira/browse/SPARK-13801
 Project: Spark
  Issue Type: Improvement
Reporter: Wenchen Fan


Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, df("key") === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13801:

Component/s: SQL

> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, df("key") === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.apply` should always return unresolved attribute. 
> We can still do the resolution to make sure the given column name is 
> resolvable, but don't return the resolved one, just get the name out and wrap 
> it with UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13801:

Description: 
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.

  was:
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === (df2("key") + 1))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.


> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.apply` should always return unresolved attribute. 
> We can still do the resolution to make sure the given column name is 
> resolvable, but don't return the resolved one, just get the name out and wrap 
> it with UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13801:

Description: 
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === (df2("key") + 1))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.

  was:
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.


> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === (df2("key") + 1))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.apply` should always return unresolved attribute. 
> We can still do the resolution to make sure the given column name is 
> resolvable, but don't return the resolved one, just get the name out and wrap 
> it with UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13801:

Description: 
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.

  was:
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, df("key") === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.


> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.apply` should always return unresolved attribute. 
> We can still do the resolution to make sure the given column name is 
> resolvable, but don't return the resolved one, just get the name out and wrap 
> it with UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13801:

Description: 
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.col` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.

  was:
Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, (df("key") + 1) === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.


> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13801:


Assignee: (was: Apache Spark)

> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189109#comment-15189109
 ] 

Apache Spark commented on SPARK-13801:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11632

> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13801) DataFrame.col should return unresolved attribute

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13801:


Assignee: Apache Spark

> DataFrame.col should return unresolved attribute
> 
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11108) OneHotEncoder should support other numeric input types

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-11108.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9777
[https://github.com/apache/spark/pull/9777]

> OneHotEncoder should support other numeric input types
> --
>
> Key: SPARK-11108
> URL: https://issues.apache.org/jira/browse/SPARK-11108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> See parent JIRA for more info.
> Also see [SPARK-10513] for motivation behind issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13796) Lock release errors occur frequently in executor logs

2016-03-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189197#comment-15189197
 ] 

Sean Owen commented on SPARK-13796:
---

[~joshrosen] does that seem like a problem?

> Lock release errors occur frequently in executor logs
> -
>
> Key: SPARK-13796
> URL: https://issues.apache.org/jira/browse/SPARK-13796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nishkam Ravi
>Priority: Minor
>
> Executor logs contain a lot of these error messages (irrespective of the 
> workload):
> 16/03/08 17:53:07 ERROR executor.Executor: 1 block locks were not released by 
> TID = 1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13512) Add example and doc for ml.feature.MaxAbsScaler

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13512:
---
Shepherd: Nick Pentreath
Assignee: yuhao yang

> Add example and doc for ml.feature.MaxAbsScaler
> ---
>
> Key: SPARK-13512
> URL: https://issues.apache.org/jira/browse/SPARK-13512
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> Add example and doc for ml.feature.MaxAbsScaler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13340) [ML] PolynomialExpansion and Normalizer should validate input type

2016-03-10 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-13340:
---
Shepherd: Nick Pentreath

> [ML] PolynomialExpansion and Normalizer should validate input type
> --
>
> Key: SPARK-13340
> URL: https://issues.apache.org/jira/browse/SPARK-13340
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Assignee: Grzegorz Chilkiewicz
>Priority: Trivial
>
> PolynomialExpansion and Normalizer should override 
> UnaryTransformer::validateInputType
> Now, in case of trying to operate on String column:
> java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.mllib.linalg.Vector
> is thrown, but:
> java.lang.IllegalArgumentException: requirement failed: Input type must be 
> VectorUDT but got StringType
> will be more clear and adequate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-03-10 Thread Zhongshuai Pei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189227#comment-15189227
 ] 

Zhongshuai Pei commented on SPARK-4105:
---

snappy or lz4?

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar erro

[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-10 Thread Y Y (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189246#comment-15189246
 ] 

Y Y commented on SPARK-13663:
-

My JIRA handle is yy2016

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Sean Owen
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13073) creating R like summary for logistic Regression in Spark - Scala

2016-03-10 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189283#comment-15189283
 ] 

Mohamed Baddar commented on SPARK-13073:


[~josephkb] After looking at source code of 
org.apache.spark.ml.classification.LogisticRegressionSummary and 
org.apache.spark.ml.classification.LogisticRegressionTrainingSummary

and after running a sample GLM in R which has the following output 

Call:
glm(formula = mpg ~ wt + hp + gear, family = gaussian(), data = mtcars)

Deviance Residuals: 
Min   1Q   Median   3Q  Max  
-3.3712  -1.9017  -0.3444   0.9883   6.0655  

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.013657   4.632264   6.911 1.64e-07 ***
wt  -3.197811   0.846546  -3.777 0.000761 ***
hp  -0.036786   0.009891  -3.719 0.000888 ***
gear 1.019981   0.851408   1.198 0.240963
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6.626347)

Null deviance: 1126.05  on 31  degrees of freedom
Residual deviance:  185.54  on 28  degrees of freedom
AIC: 157.05

Number of Fisher Scoring iterations: 2

I have the following comments :
1-I think we should add the following member to LogisticRegressionSummary : 
coefficients and residuals

2-toString method should be overridden in the following classes :
org.apache.spark.ml.classification.BinaryLogisticRegressionSummary and 
org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary

Any other suggestions ? Please correct me if have missed something.

> creating R like summary for logistic Regression in Spark - Scala
> 
>
> Key: SPARK-13073
> URL: https://issues.apache.org/jira/browse/SPARK-13073
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Samsudhin
>Priority: Minor
>
> Currently Spark ML provides Coefficients for logistic regression. To evaluate 
> the trained model tests like wald test, chi square tests and their results to 
> be summarized and display like GLM summary of R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13802) Fields order in Row is not consistent with Schema.toInternal method

2016-03-10 Thread Szymon Matejczyk (JIRA)

Szymon Matejczyk created SPARK-13802:


 Summary: Fields order in Row is not consistent with 
Schema.toInternal method
 Key: SPARK-13802
 URL: https://issues.apache.org/jira/browse/SPARK-13802
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Szymon Matejczyk


When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code:python}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code:python}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon|39|
+--+--+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-10 Thread Szymon Matejczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szymon Matejczyk updated SPARK-13802:
-
Description: 
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon | 39|
+--+--+
{code}

  was:
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon|39|
+--+--+
{code}


> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id  |first_name|
> +--+--+
> |Szymon | 39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-10 Thread Szymon Matejczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szymon Matejczyk updated SPARK-13802:
-
Summary: Fields order in Row(**kwargs) is not consistent with 
Schema.toInternal method  (was: Fields order in Row is not consistent with 
Schema.toInternal method)

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code:python}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code:python}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id  |first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-10 Thread Szymon Matejczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szymon Matejczyk updated SPARK-13802:
-
Description: 
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon|39|
+--+--+
{code}

  was:
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code:python}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code:python}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon|39|
+--+--+
{code}


> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id  |first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2016-03-10 Thread Szymon Matejczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szymon Matejczyk updated SPARK-13802:
-
Description: 
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id|first_name|
+--+--+
|Szymon|39|
+--+--+
{code}

  was:
When using Row constructor from kwargs, fields in the tuple underneath are 
sorted by name. When Schema is reading the row, it is not using the fields in 
this order.

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

schema = StructType([
StructField("id", StringType()),
StructField("first_name", StringType())])
row = Row(id="39", first_name="Szymon")
schema.toInternal(row)
Out[5]: ('Szymon', '39')
{code}

{code}
df = sqlContext.createDataFrame([row], schema)
df.show(1)

+--+--+
|id  |first_name|
+--+--+
|Szymon | 39|
+--+--+
{code}


> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Description: 
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

Please see the enclosed design doc for details.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!

  was:
This JIRA propose to make Streaming web UI display rate-limit lines in the 
statistics graph.

[Screenshots]

without back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!

with back pressure:
!https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!


> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Please see the enclosed design doc for details.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-13618:
--
Attachment: Spark-13618_Design_Doc_v1.pdf

> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
> Attachments: Spark-13618_Design_Doc_v1.pdf
>
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Please see the enclosed design doc for details.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-10 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189325#comment-15189325
 ] 

Cody Koeninger commented on SPARK-12177:


Clearly K and V are serializable somehow, because they were in a byte array
in a Kafka message. So it's not really far fetched to expect a thin wrapper
around K and V to also be serializable, and it would be convenient from a
Spark perspective.

That being said, I agree with you that the Kafka project isn't likely to go
for it. It's also probably better for Spark users if they don't blindly
cache or collect consumer records, because its a lot of wasted space (e.g.
topic name). But people are going to be surprised the first the they try
and it doesn't work, so I wanted to mention it.



> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189327#comment-15189327
 ] 

Apache Spark commented on SPARK-13618:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11633

> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
> Attachments: Spark-13618_Design_Doc_v1.pdf
>
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Please see the enclosed design doc for details.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13618) Make Streaming web UI display rate-limit lines in the statistics graph

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189328#comment-15189328
 ] 

Apache Spark commented on SPARK-13618:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11634

> Make Streaming web UI display rate-limit lines in the statistics graph
> --
>
> Key: SPARK-13618
> URL: https://issues.apache.org/jira/browse/SPARK-13618
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming, Web UI
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Liwei Lin
> Attachments: Spark-13618_Design_Doc_v1.pdf
>
>
> This JIRA propose to make Streaming web UI display rate-limit lines in the 
> statistics graph.
> Please see the enclosed design doc for details.
> [Screenshots]
> without back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664195/d2264c48-e6e0-11e5-85e6-f13187d4cbde.png!
> with back pressure:
> !https://cloud.githubusercontent.com/assets/15843379/13664196/d2549c7e-e6e0-11e5-9f62-d7f1458f1c27.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13803) Standalone master does not balance cluster-mode drivers across workers

2016-03-10 Thread Brian Wongchaowart (JIRA)

Brian Wongchaowart created SPARK-13803:
--

 Summary: Standalone master does not balance cluster-mode drivers 
across workers
 Key: SPARK-13803
 URL: https://issues.apache.org/jira/browse/SPARK-13803
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Brian Wongchaowart


The Spark standalone cluster master does not balance drivers running in cluster 
mode across all the available workers. Instead, it assigns each submitted 
driver to the first available worker. The schedule() method attempts to 
randomly shuffle the HashSet of workers before launching drivers, but that 
operation has no effect because the Scala HashSet is an unordered data 
structure. This behavior is a regression introduced by SPARK-1706: previously, 
the workers were copied into an ordered list before the random shuffle is 
performed.

I am able to reproduce this bug in all releases of Spark from 1.4.0 to 1.6.1 
using the following steps:

# Start a standalone master and two workers
# Repeatedly submit applications to the master in cluster mode (--deploy-mode 
cluster)

Observe that all the drivers are scheduled on only one of the two workers as 
long as resources are available on that worker. The expected behavior is that 
the master randomly assigns drivers to both workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13796) Lock release errors occur frequently in executor logs

2016-03-10 Thread Benjamin Herta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189389#comment-15189389
 ] 

Benjamin Herta commented on SPARK-13796:


We have been seeing this as well, running with trunk as of this morning.  It 
doesn't appear to affect our results, so it may be just log noise, or maybe our 
testing isn't thorough enough to see the problem.  One easy way to reproduce it 
is to run the org.apache.spark.examples.mllib.LBFGSExample example program.

> Lock release errors occur frequently in executor logs
> -
>
> Key: SPARK-13796
> URL: https://issues.apache.org/jira/browse/SPARK-13796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nishkam Ravi
>Priority: Minor
>
> Executor logs contain a lot of these error messages (irrespective of the 
> workload):
> 16/03/08 17:53:07 ERROR executor.Executor: 1 block locks were not released by 
> TID = 1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13663:
--
Assignee: Y Y  (was: Sean Owen)

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Y Y
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13663.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 11631
[https://github.com/apache/spark/pull/11631]

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Y Y
>Priority: Minor
> Fix For: 2.0.0, 1.6.2
>
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-10 Thread Praveen Devarao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189410#comment-15189410
 ] 

Praveen Devarao commented on SPARK-12177:
-

>>It's also probably better for Spark users if they don't blindly
cache or collect consumer records, because its a lot of wasted space (e.g.
topic name).<<

I agree

Thanks
Praveen

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13758:
--
   Priority: Trivial  (was: Major)
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Streaming
>Reporter: Mao, Wei
>Priority: Trivial
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13758.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11595
[https://github.com/apache/spark/pull/11595]

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Streaming
>Reporter: Mao, Wei
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13758:
--
Assignee: Mao, Wei

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, Streaming
>Reporter: Mao, Wei
>Assignee: Mao, Wei
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-03-10 Thread Jun Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189472#comment-15189472
 ] 

Jun Zheng commented on SPARK-12347:
---

Thanks, can you assign this task to me?

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming

2016-03-10 Thread Iain Cundy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189520#comment-15189520
 ] 

Iain Cundy commented on SPARK-2629:
---

I've discovered compaction works if I switch off Kryo.

I was using a workaround to get around mapWithState not supporting Kryo. My 
custom KryoRegistrator Java class has

// workaround until bug fixes in spark 1.6.1
kryo.register(OpenHashMapBasedStateMap.class);
kryo.register(EmptyStateMap.class);
kryo.register(MapWithStateRDDRecord.class);

which  certainly made the nullPointerException errors when checkpointing go 
away, but (inexplicably to me) doesn't allow compaction to work. 

I wonder whether the "proper" fix in 1.6.1 enables compaction? Has anybody seen 
compaction working with the patch? 

Cheers
Iain 

> Improved state management for Spark Streaming
> -
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
>  Issue Type: Epic
>  Components: Streaming
>Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
>  Current updateStateByKey provides stateful processing in Spark Streaming. It 
> allows the user to maintain per-key state and manage that state using an 
> updateFunction. The updateFunction is called for each key, and it uses new 
> data and existing state of the key, to generate an updated state. However, 
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle 
> data, (b) returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user 
> to update per-key state, and emit arbitrary records. The new API is necessary 
> as this will have significantly different semantics than the existing 
> updateStateByKey API. This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the 
> partitions of the state RDDs. The new data RDDs will be partitioned 
> appropriately, and for all the key-value data, it will lookup the map/list in 
> the state RDD partition and create a new list/map of updated state data. The 
> new state RDD partition will be created based on the update data and if 
> necessary, with old data. 
> Here is the detailed design doc (*outdated, to be updated*). Please take a 
> look and provide feedback as comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-03-10 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189537#comment-15189537
 ] 

Seth Hendrickson commented on SPARK-13068:
--

That sounds good. I will plan on implementing some "standard" type conversions 
for reuse (like lists/vectors/numpy), in the same vein as standard param 
validators are implemented in Scala. 

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13739) Predicate Push Down Through Window Operator

2016-03-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13739:

Description: 
Push down the predicate through the Window operator.

In this JIRA, predicates are pushed through Window if and only if the following 
conditions are satisfied:

- Predicate involves one and only one column that is part of window 
partitioning key
- Window partitioning key is just a sequence of attributes. (i.e., none of them 
is an expression)
- Predicate must be deterministic

  was:Push down the predicate through the Window operator.


> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributes. (i.e., none of 
> them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13739) Predicate Push Down Through Window Operator

2016-03-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13739:

Description: 
Push down the predicate through the Window operator.

In this JIRA, predicates are pushed through Window if and only if the following 
conditions are satisfied:

- Predicate involves one and only one column that is part of window 
partitioning key
- Window partitioning key is just a sequence of attributeReferences. (i.e., 
none of them is an expression)
- Predicate must be deterministic

  was:
Push down the predicate through the Window operator.

In this JIRA, predicates are pushed through Window if and only if the following 
conditions are satisfied:

- Predicate involves one and only one column that is part of window 
partitioning key
- Window partitioning key is just a sequence of attributes. (i.e., none of them 
is an expression)
- Predicate must be deterministic


> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributeReferences. (i.e., 
> none of them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13804) HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)

Michael Nguyen created SPARK-13804:
--

 Summary: HiveThriftServer2 hangs intermittently
 Key: SPARK-13804
 URL: https://issues.apache.org/jira/browse/SPARK-13804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Michael Nguyen


1. HiveThriftServer2 was started with startWithContext

2. Multiple temp tables were loaded and registered via registerTempTable .

3. HiveThriftServer2 was accessed via JDBC to access to those tables.

4. Some temp tables were dropped via 
hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
data. There are 1 to 7 million rows in these tables.

5. The same queries ran in step 3 were re-ran over the existing JDBC 
connection. This time HiveThriftServer2 receives those queries but at times 
HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
both Spark driver and child nodes was around 1%. 1GB of RAM is used out of 
30GB. So there was no resource starvation.

6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
HiveThriftServer2  returns the results of those queries fine.

This issue occurs intermittently when the steps 1-5 are repeated, so it may 
take several attempts to reproduce this issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13782) Model export/import for spark.ml: BisectingKMeans

2016-03-10 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189649#comment-15189649
 ] 

yuhao yang commented on SPARK-13782:


I'll work on this if no one has started.

> Model export/import for spark.ml: BisectingKMeans
> -
>
> Key: SPARK-13782
> URL: https://issues.apache.org/jira/browse/SPARK-13782
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13636) Direct consume UnsafeRow in wholestage codegen plans

2016-03-10 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13636.

Resolution: Fixed
  Assignee: Liang-Chi Hsieh

> Direct consume UnsafeRow in wholestage codegen plans
> 
>
> Key: SPARK-13636
> URL: https://issues.apache.org/jira/browse/SPARK-13636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> As shown in the wholestage codegen verion of Sort operator, when Sort is top 
> of Exchange (or other operator that produce UnsafeRow), we will create 
> variables from UnsafeRow, than create another UnsafeRow using these 
> variables. We should avoid the unnecessary unpack and pack variables from 
> UnsafeRows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13788) CholeskyDecomposition has side effects!

2016-03-10 Thread Ehsan Mohyedin Kermani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189700#comment-15189700
 ] 

Ehsan Mohyedin Kermani commented on SPARK-13788:


Please see my comment in github.

> CholeskyDecomposition has side effects!
> ---
>
> Key: SPARK-13788
> URL: https://issues.apache.org/jira/browse/SPARK-13788
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Ehsan Mohyedin Kermani
>
> Current CholeskyDecomposition implementation calling from lapack has side 
> effects due to mutating the array A representing the matrix in packed storage 
> format (see http://www.netlib.org/lapack/lug/node123.html ) and bx the known 
> right hand side values which should be captured by cloning arrays first!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13804) HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nguyen updated SPARK-13804:
---
Description: 
1. HiveThriftServer2 was started with startWithContext

2. Multiple temp tables were loaded and registered via registerTempTable .

3. HiveThriftServer2 was accessed via JDBC to access to those tables.

4. Some temp tables were dropped via 
hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
data. There are 1 to 7 million rows in these tables.

5. The same queries ran in step 3 were re-ran over the existing JDBC 
connection. This time HiveThriftServer2 receives those queries but at times 
HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. So 
there was no resource starvation.

6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
HiveThriftServer2  returns the results of those queries fine.

This issue occurs intermittently when the steps 1-5 are repeated, so it may 
take several attempts to reproduce this issue.


  was:
1. HiveThriftServer2 was started with startWithContext

2. Multiple temp tables were loaded and registered via registerTempTable .

3. HiveThriftServer2 was accessed via JDBC to access to those tables.

4. Some temp tables were dropped via 
hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
data. There are 1 to 7 million rows in these tables.

5. The same queries ran in step 3 were re-ran over the existing JDBC 
connection. This time HiveThriftServer2 receives those queries but at times 
HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
both Spark driver and child nodes was around 1%. 1GB of RAM is used out of 
30GB. So there was no resource starvation.

6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
HiveThriftServer2  returns the results of those queries fine.

This issue occurs intermittently when the steps 1-5 are repeated, so it may 
take several attempts to reproduce this issue.



> HiveThriftServer2 hangs intermittently
> --
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13739) Predicate Push Down Through Window Operator

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13739:


Assignee: Apache Spark

> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributeReferences. (i.e., 
> none of them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13739) Predicate Push Down Through Window Operator

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189719#comment-15189719
 ] 

Apache Spark commented on SPARK-13739:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11635

> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributeReferences. (i.e., 
> none of them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13739) Predicate Push Down Through Window Operator

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13739:


Assignee: (was: Apache Spark)

> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributeReferences. (i.e., 
> none of them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame

2016-03-10 Thread Ram Sriharsha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13795:
--
Description: 
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be 
previously an inferred column of type long that was dropped  
 
DataFrame allDataJoined = whiteEventJoinedWithReference.
   drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() throws the following exception vv

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at scala.math.Ordering$Int$.compare(Ordering.scala:256)
at scala.math.Ordering$class.gt(Ordering.scala:97)
at scala.math.Ordering$Int$.gt(Ordering.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)
at 
org.apache.spark.sql.

[jira] [Updated] (SPARK-13795) ClassCast Exception while attempting to show() a DataFrame

2016-03-10 Thread Ram Sriharsha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha updated SPARK-13795:
--
Description: 
DataFrame Schema (by printSchema() ) is as follows

allDataJoined.printSchema() 

{noformat}

 |-- eventType: string (nullable = true)
 |-- itemId: string (nullable = true)
 |-- productId: string (nullable = true)
 |-- productVersion: string (nullable = true)
 |-- servicedBy: string (nullable = true)
 |-- ACCOUNT_NAME: string (nullable = true)
 |-- CONTENTGROUPID: string (nullable = true)
 |-- PRODUCT_ID: string (nullable = true)
 |-- PROFILE_ID: string (nullable = true)
 |-- SALESADVISEREMAIL: string (nullable = true)
 |-- businessName: string (nullable = true)
 |-- contentGroupId: string (nullable = true)
 |-- salesAdviserName: string (nullable = true)
 |-- salesAdviserPhone: string (nullable = true)

{noformat}

There is NO column that has any datatype except String. There used to be 
previously an inferred column of type long that was dropped  
 
{code}

DataFrame allDataJoined = whiteEventJoinedWithReference.
   drop(rliDataFrame.col("occurredAtDate"));
allDataJoined.printSchema() : output above ^^
Now 
allDataJoined.show() 
 
{code}

throws the following exception vv

{noformat}

java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
at scala.math.Ordering$Int$.compare(Ordering.scala:256)
at scala.math.Ordering$class.gt(Ordering.scala:97)
at scala.math.Ordering$Int$.gt(Ordering.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:457)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:383)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$prunePartitions$1.apply(DataSourceStrategy.scala:257)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.prunePartitions(DataSourceStrategy.scala:257)
at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:82)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.makeBroadcastHashJoin(SparkStrategies.scala:88)
at 
org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:97)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:349)

[jira] [Created] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used

2016-03-10 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-13805:


 Summary: Direct consume ColumnVector in generated code when 
ColumnarBatch is used
 Key: SPARK-13805
 URL: https://issues.apache.org/jira/browse/SPARK-13805
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Kazuaki Ishizaki


When generated code accesses a {{ColumnarBatch}} object, it is possible to get 
values of each column from {{ColumnVector}} instead of calling {{getRow()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13794) Rename DataFrameWriter.stream DataFrameWriter.startStream

2016-03-10 Thread Ram Sriharsha (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram Sriharsha resolved SPARK-13794.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Rename DataFrameWriter.stream DataFrameWriter.startStream
> -
>
> Key: SPARK-13794
> URL: https://issues.apache.org/jira/browse/SPARK-13794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This makes it more obvious with the verb "start" that we are actually 
> starting some execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13727) SparkConf.contains does not consider deprecated keys

2016-03-10 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13727.

   Resolution: Fixed
 Assignee: Bo Meng
Fix Version/s: 2.0.0

> SparkConf.contains does not consider deprecated keys
> 
>
> Key: SPARK-13727
> URL: https://issues.apache.org/jira/browse/SPARK-13727
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Bo Meng
>Priority: Minor
> Fix For: 2.0.0
>
>
> This makes it kinda inconsistent with other SparkConf APIs. For example:
> {code}
> scala> import org.apache.spark.SparkConf
> import org.apache.spark.SparkConf
> scala> val conf = new SparkConf().set("spark.io.compression.lz4.block.size", 
> "12345")
> 16/03/07 10:55:17 WARN spark.SparkConf: The configuration key 
> 'spark.io.compression.lz4.block.size' has been deprecated as of Spark 1.4 and 
> and may be removed in the future. Please use the new key 
> 'spark.io.compression.lz4.blockSize' instead.
> conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@221e8982
> scala> conf.get("spark.io.compression.lz4.blockSize")
> res0: String = 12345
> scala> conf.contains("spark.io.compression.lz4.blockSize")
> res1: Boolean = false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13804) HiveThriftServer2 hangs intermittently

2016-03-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189792#comment-15189792
 ] 

Sean Owen commented on SPARK-13804:
---

Isn't this a Hive issue?

> HiveThriftServer2 hangs intermittently
> --
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used

2016-03-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189821#comment-15189821
 ] 

Apache Spark commented on SPARK-13805:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/11636

> Direct consume ColumnVector in generated code when ColumnarBatch is used
> 
>
> Key: SPARK-13805
> URL: https://issues.apache.org/jira/browse/SPARK-13805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When generated code accesses a {{ColumnarBatch}} object, it is possible to 
> get values of each column from {{ColumnVector}} instead of calling 
> {{getRow()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13805:


Assignee: (was: Apache Spark)

> Direct consume ColumnVector in generated code when ColumnarBatch is used
> 
>
> Key: SPARK-13805
> URL: https://issues.apache.org/jira/browse/SPARK-13805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When generated code accesses a {{ColumnarBatch}} object, it is possible to 
> get values of each column from {{ColumnVector}} instead of calling 
> {{getRow()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13805) Direct consume ColumnVector in generated code when ColumnarBatch is used

2016-03-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13805:


Assignee: Apache Spark

> Direct consume ColumnVector in generated code when ColumnarBatch is used
> 
>
> Key: SPARK-13805
> URL: https://issues.apache.org/jira/browse/SPARK-13805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> When generated code accesses a {{ColumnarBatch}} object, it is possible to 
> get values of each column from {{ColumnVector}} instead of calling 
> {{getRow()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13804) org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nguyen updated SPARK-13804:
---
Summary: org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs 
intermittently  (was: HiveThriftServer2 hangs intermittently)

> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently
> -
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13804) org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189827#comment-15189827
 ] 

Michael Nguyen commented on SPARK-13804:


HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so 
it is an issue with with Spark SQL. Also, the root cause could be with how 
dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for 
the same table. So further analysis is needed to determine the root cause.

> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently
> -
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13804) org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189827#comment-15189827
 ] 

Michael Nguyen edited comment on SPARK-13804 at 3/10/16 8:02 PM:
-

HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so 
it is an issue with Spark SQL. Also, the root cause could be with how 
dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for 
the same table, and possibly with data sources that have 1+ to 7+ million rows. 
So further analysis is needed to determine the root cause.


was (Author: michaelmnguyen):
HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so 
it is an issue with with Spark SQL. Also, the root cause could be with how 
dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for 
the same table. So further analysis is needed to determine the root cause.

> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently
> -
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13804) org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently

2016-03-10 Thread Michael Nguyen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189827#comment-15189827
 ] 

Michael Nguyen edited comment on SPARK-13804 at 3/10/16 8:03 PM:
-

HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so 
it is an issue with Spark SQL. Also, the root cause could be with how 
dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for 
the same table, and possibly with large tables such as 7+ million rows. So 
further analysis is needed to determine the root cause.


was (Author: michaelmnguyen):
HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so 
it is an issue with Spark SQL. Also, the root cause could be with how 
dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for 
the same table, and possibly with data sources that have 1+ to 7+ million rows. 
So further analysis is needed to determine the root cause.

> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently
> -
>
> Key: SPARK-13804
> URL: https://issues.apache.org/jira/browse/SPARK-13804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Nguyen
>
> 1. HiveThriftServer2 was started with startWithContext
> 2. Multiple temp tables were loaded and registered via registerTempTable .
> 3. HiveThriftServer2 was accessed via JDBC to access to those tables.
> 4. Some temp tables were dropped via 
> hiveContext.dropTempTable(registerTableName); and reloaded to refresh their 
> data. There are 1 to 7 million rows in these tables.
> 5. The same queries ran in step 3 were re-ran over the existing JDBC 
> connection. This time HiveThriftServer2 receives those queries but at times 
> HiveThriftServer2  hangs and does not return the results.  CPU utilization on 
> both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 
> 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. 
> So there was no resource starvation.
> 6. Wait about 5 minutes and rerun the same queries in step 5, and this time, 
> HiveThriftServer2  returns the results of those queries fine.
> This issue occurs intermittently when the steps 1-5 are repeated, so it may 
> take several attempts to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13759.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11594
[https://github.com/apache/spark/pull/11594]

> Add IsNotNull constraints for expressions with an inequality
> 
>
> Key: SPARK-13759
> URL: https://issues.apache.org/jira/browse/SPARK-13759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Support for adding `IsNotNull` constraints from expressions with an 
> inequality. More specifically, if an operator has a condition on `a !== b`, 
> we know that both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-10 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13759:
-
Assignee: Sameer Agarwal

> Add IsNotNull constraints for expressions with an inequality
> 
>
> Key: SPARK-13759
> URL: https://issues.apache.org/jira/browse/SPARK-13759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> Support for adding `IsNotNull` constraints from expressions with an 
> inequality. More specifically, if an operator has a condition on `a !== b`, 
> we know that both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 183 matches

Mail list logo