[GitHub] spark pull request #16540: Nullability udfs

2017-01-10 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/16540


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16540: Nullability udfs

2017-01-10 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/16540

Nullability udfs

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/damnMeddlingKid/spark nullability_udfs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16540.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16540


commit 5083b08f511b852c66bc006c41656f0560a893d6
Author: Franklyn D'souza 
Date:   2017-01-10T21:46:10Z

python changes

commit fbe780f84226cf72de801ad7931fa1b4d1acd2e5
Author: Franklyn D'souza 
Date:   2017-01-10T22:49:44Z

introduce nullability for scala udfs

commit a00544c8709dce65d3514546348bdd0459dc25a3
Author: Franklyn D'souza 
Date:   2017-01-10T23:05:34Z

add nullability to scala python udfs

commit d85230263156684adfabecef20293477dd30e957
Author: Franklyn D'souza 
Date:   2017-01-10T23:21:25Z

check for none in wrapped function




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14164: [SPARK-16629] Allow comparisons between UDTs and ...

2016-11-29 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/14164


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14164: [SPARK-16629] Allow comparisons between UDTs and Datatyp...

2016-08-05 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the issue:

https://github.com/apache/spark/pull/14164
  
I've tested this successfully with int and timestamp types, but it doesn't 
seem to work with DecimalType. Anyone know what could be wrong ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14164: Allow comparisons between UDTs and Datatypes

2016-07-12 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/14164

Allow comparisons between UDTs and Datatypes

## What changes were proposed in this pull request?
Currently UDTs can not be compared to Datatypes even if their sqlTypes 
match. this leads to errors like this 

```

In [12]: thresholded = df.filter(df['udt_time'] > threshold)
---
AnalysisException Traceback (most recent call last)
/Users/franklyndsouza/dev/starscream/bin/starscream in ()
> 1 thresholded = df.filter(df['tick_tock_est'] > threshold)

AnalysisException: u"cannot resolve '(`tick_tock_est` > 
TIMESTAMP('2015-10-20 01:00:00.0'))' due to data typ mismatch: 
'(`tick_tock_est` > TIMESTAMP('2015-10-20 01:00:00.0'))' requires (boolean or 
tinyint or smallint or int or bigint or float or double or decimal or timestamp 
or date or string or binary) type, not pythonuserdefined"

```

This PR adds some comparisons that allow UDTs to be correctly compared to a 
Datatype.


## How was this patch tested?

Built locally and tested in the pyspark repl.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/damnMeddlingKid/spark fix-df-filtering

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14164


commit d0d31ca18c49fd24476d8b7291cb16d5f346ee6e
Author: Franklyn D'souza 
Date:   2016-07-12T22:17:25Z

allow comparisons between UDTs and Datatypes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13717: [SPARK-15811] [SQL] fix the Python UDF in Scala 2.10

2016-06-16 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the issue:

https://github.com/apache/spark/pull/13717
  
Just to take a step back, Is the suite lacking coverage for this feature ?. 
This sort of thing should have been caught in the unit tests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/11333


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the pull request:

https://github.com/apache/spark/pull/11333#issuecomment-187964146
  
Hoping to get this into 1.6.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/11333

[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.

## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be 
correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 
'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), 
True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```


## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite. 



Additional information here : 
https://issues.apache.org/jira/browse/SPARK-13410



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/damnMeddlingKid/spark udt-union-patch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11333


commit c14d1ba953ecbfa141887f801445ffa8ab280dee
Author: Franklyn D'souza 
Date:   2016-02-23T21:48:20Z

support unionAll for dataframes with UDT columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/11330


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/11330

[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.

## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be 
correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 
'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), 
True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```


## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite. 



Additional information here : 
https://issues.apache.org/jira/browse/SPARK-13410



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/damnMeddlingKid/spark udt-union-all

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11330


commit 6f0f1d9e04a8db47e2f6f8fcfe9dea9de0f633da
Author: Cheng Lian 
Date:   2016-01-25T23:05:05Z

[SPARK-12934][SQL] Count-min sketch serialization

This PR adds serialization support for `CountMinSketch`.

A version number is added to version the serialized binary format.

Author: Cheng Lian 

Closes #10893 from liancheng/cms-serialization.

commit be375fcbd200fb0e210b8edcfceb5a1bcdbba94b
Author: Wenchen Fan 
Date:   2016-01-26T00:23:59Z

[SPARK-12879] [SQL] improve the unsafe row writing framework

As we begin to use unsafe row writing framework(`BufferHolder` and 
`UnsafeRowWriter`) in more and more places(`UnsafeProjection`, 
`UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add 
more doc to it and make it easier to use.

This PR abstract the technique used in `UnsafeRowParquetRecordReader`: 
avoid unnecessary operatition as more as possible. For example, do not always 
point the row to the buffer at the end, we only need to update the size of row. 
If all fields are of primitive type, we can even save the row size updating. 
Then we can apply this technique to more places easily.

a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this 
PR:
**old version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection: Avg Time(ms)Avg Rate(M/s)  Relative 
Rate

---
single long 2616.04   102.61 
1.00 X
single nullable long3032.5488.52 
0.86 X
primitive types 9121.0529.43 
0.29 X
nullable primitive types   12410.6021.63 
0.21 X
```

**new version**
```
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
unsafe projection: Avg Time(ms)Avg Rate(M/s)  Relative 
Rate

---
single long 1533.34   175.07 
1.00 X
single nullable long2306.73   116.37 
0.66 X
primitive types 8403.9331.94 
0.18 X
nullable primitive types   12448.3921.56 
0.12 X
```

For single non-nullable long(the best case), we can have about 1.7x speed 
up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's 
not such a boost as the saved operations only take a little proportion of the 
whole process.  The benchmark code is included in this PR.

Author: Wenchen Fan 

Closes #10809 from cloud-fan/unsafe-projection.

commit 109061f7ad27225669cbe609ec38756b31d4e1b9
Author: Wenchen Fan 
Date:   2016-01-26T01:58:11Z

[SPARK-12936][SQL] Initial bloom filter implementation

This PR adds an initial implementation of bloom filter in the newly added 
sketch module.  The implementation is based on the [`BloomFilter` class in 
guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).

Some difference from the design doc:

* expose `bitSize` instead of `sizeInBytes` to us

[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-23 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the pull request:

https://github.com/apache/spark/pull/11279#issuecomment-187749779
  
@rxin any chance this will make it into 1.6.1 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-20 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the pull request:

https://github.com/apache/spark/pull/11279#issuecomment-186738811
  
Yeah I think its just the order of the output. I've made the ordering more 
explicit now, i've run these tests on my local machine and they pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-19 Thread damnMeddlingKid
Github user damnMeddlingKid commented on the pull request:

https://github.com/apache/spark/pull/11279#issuecomment-186467863
  
That should be it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13410][SQL] Support unionAll for DataFr...

2016-02-19 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/11279

[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.

## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be 
correctly tested for dataType equality during union operations.


## How was the this patch tested?

Tested using two unit tests in test.py and the DataFrameSuite. These tests 
fail without this patch with 

"AnalysisException: u"unresolved operator 'Union;""

Additional information here : 
https://issues.apache.org/jira/browse/SPARK-13410



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/damnMeddlingKid/spark udt-union-all

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11279


commit fc8ea19bb4deebfc74bedc1d4092d9c9dd9ace00
Author: Franklyn D'souza 
Date:   2016-02-19T04:40:04Z

support union all for UDT

commit 2642f68ad67bf6d7110d0da9f19daad295695fd1
Author: Franklyn D'souza 
Date:   2016-02-19T21:46:12Z

test unionAll for udt dfs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Kafka streaming

2015-12-03 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/10136


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Kafka streaming

2015-12-03 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/10136

Kafka streaming



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Shopify/spark kafka_streaming

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10136


commit 854319e589c89b2b6b4a9d02916f6f748fc5680a
Author: Fernando Otero (ZeoS) 
Date:   2015-01-08T20:42:54Z

SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS 
configurable

Author: Fernando Otero (ZeoS) 

Closes #3953 from zeitos/storageLevel and squashes the following commits:

0f070b9 [Fernando Otero (ZeoS)] fix imports
6869e80 [Fernando Otero (ZeoS)] fix comment length
90c9f7e [Fernando Otero (ZeoS)] fix comment length
18a992e [Fernando Otero (ZeoS)] changing storage level

commit d9cad94b1df0200207ba03fb0168373ccc3a8597
Author: Kousuke Saruta 
Date:   2015-01-08T21:43:09Z

[SPARK-4973][CORE] Local directory in the driver of client-mode continues 
remaining even if application finished when external shuffle is enabled

When we enables external shuffle service, local directories in the driver 
of client-mode continue remaining even if application has finished.
I think local directories for drivers should be deleted.

Author: Kousuke Saruta 

Closes #3811 from sarutak/SPARK-4973 and squashes the following commits:

ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory 
if it's the driver
43770da [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-4973
88feecd [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-4973
d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala 
in order to delete local directories of the driver of local-mode when external 
shuffle service is enabled

commit b14068bf7b2dff450101d48a59e79761e3ca4eb2
Author: RJ Nowling 
Date:   2015-01-08T23:03:43Z

[SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P...

...ySpark MLlib

This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 .

Author: RJ Nowling 

Closes #3955 from rnowling/spark4891 and squashes the following commits:

1236a01 [RJ Nowling] Fix Python style issues
7a01a78 [RJ Nowling] Fix Python style issues
174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp 
dist sampling to PySpark MLlib

commit 5a1b7a9c8a77b6d1ef5553490d0ccf291dfac06f
Author: Marcelo Vanzin 
Date:   2015-01-09T01:15:13Z

[SPARK-4048] Enhance and extend hadoop-provided profile.

This change does a few things to make the hadoop-provided profile more 
useful:

- Create new profiles for other libraries / services that might be provided 
by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while 
building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these 
profiles to provide the runtime
  classpath for Spark jobs and daemons.

Author: Marcelo Vanzin 

Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:

82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to 
child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with 
another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided 
profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provi

[GitHub] spark pull request: New spark

2015-10-28 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/9342


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: New spark

2015-10-28 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/9342

New spark



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Shopify/spark new_spark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9342


commit 60b922795d0d6a5e0db96c11416804153e307810
Author: Zhang, Liye 
Date:   2015-01-08T18:40:26Z

[SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in 
standalone mode

when enabling eventlog in standalone mode, if give the wrong configuration, 
the standalone cluster will down (cause master restart, lose connection with 
workers).
How to reproduce: just give an invalid value to "spark.eventLog.dir", for 
example: spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2. This will 
throw illegalArgumentException, which will cause the Master restart. And the 
whole cluster is not available.

Author: Zhang, Liye 

Closes #3824 from liyezhang556520/wrongConf4Cluster and squashes the 
following commits:

3c24d98 [Zhang, Liye] revert change with logwarning and excetption for 
FileNotFoundException
3c1ac2e [Zhang, Liye] change var to val
a49c52f [Zhang, Liye] revert wrong modification
12eee85 [Zhang, Liye] add more message in log and on webUI
5c1fa33 [Zhang, Liye] cache exceptions when eventlog with wrong conf

commit a9940b5a04c905698f17940669a161fcd414284f
Author: Kousuke Saruta 
Date:   2015-01-08T19:35:56Z

[Minor] Fix the value represented by spark.executor.id for consistency.

The property  `spark.executor.id` can represent both `driver` and 
``  for one driver.
It's inconsistent.

This issue is minor so I didn't file this in JIRA.

Author: Kousuke Saruta 

Closes #3812 from sarutak/fix-driver-identifier and squashes the following 
commits:

d885498 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into fix-driver-identifier
4275663 [Kousuke Saruta] Fixed the value represented by spark.executor.id 
of local mode

commit b4fb97df2cbdd743656e000fefe471406619220c
Author: WangTaoTheTonic 
Date:   2015-01-08T19:45:42Z

[SPARK-5130][Deploy]Take yarn-cluster as cluster mode in spark-submit

https://issues.apache.org/jira/browse/SPARK-5130

Author: WangTaoTheTonic 

Closes #3929 from WangTaoTheTonic/SPARK-5130 and squashes the following 
commits:

c490648 [WangTaoTheTonic] take yarn-cluster as cluster mode in spark-submit

commit 31d67152c2cbbe2e076003b3ff0d0a7e2f801549
Author: Eric Moyer 
Date:   2015-01-08T19:55:23Z

Document that groupByKey will OOM for large keys

This pull request is my own work and I license it under Spark's open-source 
license.

This contribution is an improvement to the documentation. I documented that 
the maximum number of values per key for groupByKey is limited by available RAM 
(see [Datablox][datablox link] and [the spark mailing list][list link]).

Just saying that better performance is available is not sufficient. 
Sometimes you need to do a group-by - your operation needs all the items 
available in order to complete. This warning explains the problem.

[datablox link]: 
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
[list link]: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html

Author: Eric Moyer 

Closes #3936 from RadixSeven/better-group-by-docs and squashes the 
following commits:

5b6f4e9 [Eric Moyer] groupByKey docs naming updates
238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys

commit 854319e589c89b2b6b4a9d02916f6f748fc5680a
Author: Fernando Otero (ZeoS) 
Date:   2015-01-08T20:42:54Z

SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS 
configurable

Author: Fernando Otero (ZeoS) 

Closes #3953 from zeitos/storageLevel and squashes the following commits:

0f070b9 [Fernando Otero (ZeoS)] fix imports
6869e80 [Fernando Otero (ZeoS)] fix comment length
90c9f7e [Fernando Otero (ZeoS)] fix comment length
18a992e [Fernando Otero (ZeoS)] changing storage level

commit d9cad94b1df0200207ba03fb0168373ccc3a8597
Author: Kousuke Saruta 
Date:   2015-01-08T21:43:09Z

[SPARK-4973][CORE] Local directory in the driver of client-mode continues 
remaining even if application finished when external shuffle is enabled

When we enables external shuffle service, local directories in the driver 
of client-mode continue remaining even if application has finished.
I 

[GitHub] spark pull request: Update spark

2015-10-28 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/9341


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Update spark

2015-10-28 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/9341

Update spark



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Shopify/spark update_spark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9341


commit 4639372eb9325f466b78d074bfc24d8d5a93322e
Author: Alex Angelini 
Date:   2015-10-19T17:07:39Z

[SPARK-9643] Upgrade pyrolite to 4.9

Includes: https://github.com/irmen/Pyrolite/pull/23 which fixes datetimes 
with timezones.

JoshRosen

https://issues.apache.org/jira/browse/SPARK-9643

Author: Alex Angelini 

Closes #7950 from angelini/upgrade_pyrolite_up.

commit 7b4cd3da570c098da5adef82d394c84d3df8d602
Author: Holden Karau 
Date:   2015-10-20T17:52:49Z

[SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9

Upgrade to Py4j0.9

Author: Holden Karau 
Author: Holden Karau 

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.

commit 56c6e7846e00c7deacf8349a93e517c7ed496ee5
Author: Nick Evans 
Date:   2015-10-27T08:29:06Z

[SPARK-11270][STREAMING] Add improved equality testing for 
TopicAndPartition from the Kafka Streaming API

jerryshao tdas

I know this is kind of minor, and I know you all are busy, but this brings 
this class in line with the `OffsetRange` class, and makes tests a little more 
concise.

Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```

You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```

Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```

After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```

I couldn't find any tests - am I missing something?

Author: Nick Evans 

Closes #9236 from manygrams/topic_and_partition_equality.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Packserv

2015-10-16 Thread damnMeddlingKid
Github user damnMeddlingKid closed the pull request at:

https://github.com/apache/spark/pull/9151


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Packserv

2015-10-16 Thread damnMeddlingKid
GitHub user damnMeddlingKid opened a pull request:

https://github.com/apache/spark/pull/9151

Packserv



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Shopify/spark packserv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9151


commit 60fde12bc4e824c1447db69f92387f35e9b67331
Author: hushan[胡珊] 
Date:   2015-01-07T20:09:12Z

[SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson

SPARK-5132:
stageInfoToJson: Stage Attempt Id
stageInfoFromJson: Attempt Id

Author: hushan[胡珊] 

Closes #3932 from suyanNone/json-stage and squashes the following commits:

41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson

commit 65c9e1022521053e130220802bbfddd1dba0733e
Author: zsxwing 
Date:   2015-01-08T07:01:30Z

[SPARK-5126][Core] Verify Spark urls before creating Actors so that invalid 
urls can crash the process.

Because `actorSelection` will return `deadLetters` for an invalid path,  
Worker keeps quiet for an invalid master url. It's better to log an error so 
that people can find such problem quickly.

This PR will check the url before sending to `actorSelection`, throw and 
log a SparkException for an invalid url.

Author: zsxwing 

Closes #3927 from zsxwing/SPARK-5126 and squashes the following commits:

9d429ee [zsxwing] Create a utility method in Utils to parse Spark url; 
verify urls before creating Actors so that invalid urls can crash the process.
8286e51 [zsxwing] Check the url before sending to Akka and log the error if 
the url is invalid

commit 536b82f9cb5535e57393eee401ebddad524aee26
Author: Shuo Xiang 
Date:   2015-01-08T07:22:37Z

[SPARK-5116][MLlib] Add extractor for SparseVector and DenseVector

Add extractor for SparseVector and DenseVector in MLlib to save some code 
while performing pattern matching on Vectors. For example, previously we may 
use:

 vec match {
  case dv: DenseVector =>
val values = dv.values
...
  case sv: SparseVector =>
val indices = sv.indices
val values = sv.values
val size = sv.size
...
  }

with extractor it is:

vec match {
case DenseVector(values) =>
  ...
case SparseVector(size, indices, values) =>
  ...
}

Author: Shuo Xiang 

Closes #3919 from coderxiang/extractor and squashes the following commits:

359e8d5 [Shuo Xiang] merge master
ca5fc3e [Shuo Xiang] merge master
0b1e190 [Shuo Xiang] use extractor for vectors in RowMatrix.scala
e961805 [Shuo Xiang] use extractor for vectors in StandardScaler.scala
c2bbdaf [Shuo Xiang] use extractor for vectors in IDFscala
8433922 [Shuo Xiang] use extractor for vectors in NaiveBayes.scala and 
Normalizer.scala
d83c7ca [Shuo Xiang] use extractor for vectors in Vectors.scala
5523dad [Shuo Xiang] Add extractor for SparseVector and DenseVector

commit 0114e817977782e2e9ae6eeb3d2719f5aa76148b
Author: Sandy Ryza 
Date:   2015-01-08T17:25:43Z

SPARK-5087. [YARN] Merge yarn.Client and yarn.ClientBase

Author: Sandy Ryza 

Closes #3896 from sryza/sandy-spark-5087 and squashes the following commits:

65611d0 [Sandy Ryza] Review feedback
3294176 [Sandy Ryza] SPARK-5087. [YARN] Merge yarn.Client and 
yarn.ClientBase

commit 46dca8c79d6de431a8088f1346ddd500d91a7203
Author: Takeshi Yamamuro 
Date:   2015-01-08T17:55:12Z

[SPARK-4917] Add a function to convert into a graph with canonical edges in 
GraphOps

Convert bi-directional edges into uni-directional ones instead of 
'canonicalOrientation' in GraphLoader.edgeListFile.
This function is useful when a graph is loaded as it is and then is 
transformed into one with canonical edges.
It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, 
and merges the duplicated edges.

Author: Takeshi Yamamuro 

Closes #3760 from maropu/ConvertToCanonicalEdgesSpike and squashes the 
following commits:

7f8b580 [Takeshi Yamamuro] Add a function to convert into a graph with 
canonical edges in GraphOps

commit 60b922795d0d6a5e0db96c11416804153e307810
Author: Zhang, Liye 
Date:   2015-01-08T18:40:26Z

[SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in 
standalone mode

when enabling eventlog in standalone mode, if give the wrong configuration, 
the standalone cluster will down (cause master restart, lose connection with 
workers).
How to reproduce: ju