[GitHub] spark pull request #14940: [SPARK-17383][GRAPHX]LabelPropagation

2016-09-02 Thread bookling
GitHub user bookling opened a pull request:

https://github.com/apache/spark/pull/14940

[SPARK-17383][GRAPHX]LabelPropagation

In the labelPropagation of graphx lib, node is initialized with a unique
label and at every step each node adopts the label that most of its 
neighbors currently have, but ignore the label it currently have. I think it is 
unreasonable, because the labe a node had is also useful. When a node trend to 
has a stable label, this means there is an association between two iterations, 
so a node not only affected by its neighbors, but also its current label.
so I change the code, and use both the label of its neighbors and itself.

This iterative process densely connected groups of nodes form a consensus 
on a unique label to form
communities. But the communities of the LabelPropagation often 
discontinuous.
Because when the label that most of its neighbors currents have are 
many,e.g, node "0" has 6 neigbors labed {"1","1","2","2","3","3"},it maybe 
randomly select a label. in order to get a stable label of communities, and 
prevent the randomness, so I chose the max lable of node.

you can test graph with Edges: {10L->11L,10L->12L, 
11L->12L,11L->14L,12L->14L,13L->14L,13L->15L,13L->16L,15L->16L,15L->17L,16L->17L
 };or dandelion shape {1L->2L,2L->7L,2L->3L,2L->4L,2L->5L,2L->6L},etc.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bookling/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14940.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14940


commit 11bdab6bb042cd2102570c96db17279cf6ebbd92
Author: bookling 
Date:   2016-08-30T17:51:43Z

to solve "label shock "

I have test the result, which  is more reasonable
Because the LabelPropagation often suffers "labe shock", and the result 
of communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.

commit bb875fef8f47ec99878d972f2c17b50123375a4c
Author: bookling 
Date:   2016-08-30T17:55:06Z

to reduce "label shock " 

I have test the result, which  is more reasonable
Because the LabelPropagation often suffers "labe shock", and the result 
of communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.

commit 60e6f0ee2a3cdfb2b526a6d12887513f3aabed42
Author: XiaoSen Lee 
Date:   2016-09-02T18:57:29Z

Improvement labelPropagation of garphx lib



In the labelPropagation of graphx lib, node is initialized with a unique
label and at every step each node adopts the label that most of its 
neighbors currently have, but ignore the label it currently have. I think it is 
unreasonable, because the labe a node had is also useful. When a node trend to 
has a stable label, this means there is an association between two iterations, 
so a node not only affected by its neighbors, but also its current label.
so I change the code, and use both the label of its neighbors and itself.

This iterative process densely connected groups of nodes form a consensus 
on a unique label to form
communities. But the communities of the LabelPropagation often 
discontinuous.
Because when the label that most of its neighbors currents have are 
many,e.g, node "0" has 6 neigbors labed {"1","1","2","2","3","3"},it maybe 
randomly select a label. in order to get a stable label of communities, and 
prevent the randomness, so I chose the max lable of node.

you can test graph with Edges: {10L->11L,10L->12L, 
11L->12L,11L->14L,12L->14L,13L->14L,13L->15L,13L->16L,15L->16L,15L->17L,16L->17L
 };or dandelion shape {1L->2L,2L->7L,2L->3L,2L->4L,2L->5L,2L->6L},etc.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14880: improve the LabelPropagation of graphx lib, and r...

2016-09-01 Thread bookling
Github user bookling closed the pull request at:

https://github.com/apache/spark/pull/14880


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14880: reduce the "label shock" of LabelPropagation in Graphx, ...

2016-08-31 Thread bookling
Github user bookling commented on the issue:

https://github.com/apache/spark/pull/14880
  
In the labelPropagation of graphx lib, node is initialized with a unique
label and at every step each node adopts the label that most of its 
neighbors currently have, but ignore the label it currently have. I think it is 
unreasonable, because the labe a node had is also useful. When a node trend to 
has a stable label, this means there is an association between two iterations, 
so a node not only affected by its neighbors, but also its current label. 
so I change the  code, and use both  the label of its neighbors  and 
itself. 


This iterative process densely connected groups of nodes form a consensus 
on a unique label to form
communities. But the communities of the LabelPropagation often 
discontinuous. 
Because when the label that most of its neighbors currents have are 
many,e.g, node "0" has 6 neigbors labed {"1","1","2","2","3","3"},it maybe 
randomly select a label. in order to get a stable label of communities, and 
prevent the randomness, so I chose the max lable of node. 

you can test graph with Edges: 
{10L->11L,11L->12L,11L->14L,12L->14L,13L->14L,13L->15L,13L->16L,15L->16L,15L->17L,16L->17L
 };or dandelion shape {1L->2L,2L->7L,2L->3L,2L->4L,2L->5L,2L->6L},etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14880: a better result of community, reduce "label shock"

2016-08-30 Thread bookling
Github user bookling commented on the issue:

https://github.com/apache/spark/pull/14880
  
a better result of communities


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14880: a better result of community, reduce "label shock...

2016-08-30 Thread bookling
GitHub user bookling opened a pull request:

https://github.com/apache/spark/pull/14880

a better result of community, reduce "label shock"


I test the result, which  is more reasonable
Because the LabelPropagation often suffers "labe shock", and the result 
of communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bookling/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14880.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14880


commit f1a379324c6916c7eefc2b3e652d3f86009c64cb
Author: bookling 
Date:   2016-08-30T18:13:42Z

a better result of community, reduce "label shock"

I test the result, which  is more reasonable
Because the LabelPropagation often suffers "labe shock", and the result 
of communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14878: to solve "label shock "

2016-08-30 Thread bookling
Github user bookling closed the pull request at:

https://github.com/apache/spark/pull/14878


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14878: to solve "label shock "

2016-08-30 Thread bookling
Github user bookling commented on the issue:

https://github.com/apache/spark/pull/14878
  
the LabelPropagation   often suffers "labe shock", and the result of 
communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14878: to solve "label shock "

2016-08-30 Thread bookling
GitHub user bookling opened a pull request:

https://github.com/apache/spark/pull/14878

to solve "label shock "



the LabelPropagation often suffers "labe shock", and the result of 
communities are often non-adjacent.
 I think the label of node  is  userful between adjacent supersteps, and 
the adjacent supersteps are relevant.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bookling/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14878


commit 5effc016c893ce917d535cc1b5026d8e4c846721
Author: Wenchen Fan 
Date:   2016-08-05T08:50:26Z

[SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTAS

## What changes were proposed in this pull request?

we have various logical plans for CREATE TABLE and CTAS: 
`CreateTableUsing`, `CreateTableUsingAsSelect`, 
`CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the 
complexity and centralize the error handling.

## How was this patch tested?

existing tests

Author: Wenchen Fan 

Closes #14482 from cloud-fan/table.

commit c9f2501af278241f780a38b9562e193755ed5af3
Author: cody koeninger 
Date:   2016-08-05T09:13:32Z

[SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integration

## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration

## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they 
compile.

Author: cody koeninger 

Closes #14385 from koeninger/SPARK-16312.

commit e026064143367e4614cb866e321cc521fdde3170
Author: petermaxlee 
Date:   2016-08-05T10:06:36Z

[MINOR] Update AccumulatorV2 doc to not mention "+=".

## What changes were proposed in this pull request?
As reported by Bryan Cutler on the mailing list, AccumulatorV2 does not 
have a += method, yet the documentation still references it.

## How was this patch tested?
N/A

Author: petermaxlee 

Closes #14466 from petermaxlee/accumulator.

commit 39a2b2ea74d420caa37019e3684f65b3a6fcb388
Author: Yuming Wang 
Date:   2016-08-05T15:11:54Z

[SPARK-16625][SQL] General data types to be mapped to Oracle

## What changes were proposed in this pull request?

Spark will convert **BooleanType** to **BIT(1)**, **LongType** to 
**BIGINT**, **ByteType**  to **BYTE** when saving DataFrame to Oracle, but 
Oracle does not support BIT, BIGINT and BYTE types.

This PR is convert following _Spark Types_ to _Oracle types_ refer to 
[Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/)

Spark Type | Oracle
|
BooleanType | NUMBER(1)
IntegerType | NUMBER(10)
LongType | NUMBER(19)
FloatType | NUMBER(19, 4)
DoubleType | NUMBER(19, 4)
ByteType | NUMBER(3)
ShortType | NUMBER(5)

## How was this patch tested?

Add new tests in 
[JDBCSuite.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-dc4b58851b084b274df6fe6b189db84d)
 and 
[OracleDialect.scala](https://github.com/wangyum/spark/commit/22b0c2a4228cb8b5098ad741ddf4d1904e745ff6#diff-5e0cadf526662f9281aa26315b3750ad)

Author: Yuming Wang 

Closes #14377 from wangyum/SPARK-16625.

commit 2460f03ffe94154b73995e4f16dd799d1a0f56b8
Author: Sylvain Zimmer 
Date:   2016-08-05T19:55:58Z

[SPARK-16826][SQL] Switch to java.net.URI for parse_url()

## What changes were proposed in this pull request?
The java.net.URL class has a globally synchronized Hashtable, which limits 
the throughput of any single executor doing lots of calls to parse_url(). Tests 
have shown that a 36-core machine can only get to 10% CPU use because the 
threads are locked most of the time.

This patch switches to java.net.URI which has less features than 
java.net.URL but focuses on URI parsing, which is enough for parse_url().

New tests were added to make sure a few common edge cases didn't change 
behaviour.
https://issues.apache.org/jira/browse/SPARK-16826

## How was this patch tested?
I've kept the old URL code commented for now, so that people can verify 
that the new unit tests do pass with java.net.URL.

Thanks to srowen for the help!

Author: Sylvain Zimmer 

Closes #14488 from sylvinus/master.

commit 180fd3e0a3426db200c97170926afb60751dfd0e
Author: Bryan Cutler 
Date:   2016-08-05T19:57:46Z

[SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs

## What changes were proposed in this pull request?
Improve example outputs to better reflect the