[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml

2018-11-29 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/23179 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #23179: Fix the rat excludes on .policy.yml

2018-11-29 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/23179 Fix the rat excludes on .policy.yml ## What changes were proposed in this pull request? Fix the rat excludes on .policy.yml You can merge this pull request into a Git repository

[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...

2018-11-27 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/20877 Sorry, I won't be able to take it over! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...

2018-11-15 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/23051 [AE2.3-02][SPARK-23128] Add QueryStage and the framework for adaptive execution (auto setting the number of reducer) ## What changes were proposed in this pull request? Add QueryStage

[GitHub] spark pull request #23051: [AE2.3-02][SPARK-23128] Add QueryStage and the fr...

2018-11-15 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/23051 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #22968: Merge upstream

2018-11-07 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/22968 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #22968: Merge upstream

2018-11-07 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22968 Merge upstream ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how

[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-18 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r226386187 --- Diff: sql/core/src/test/resources/test-data/cars-crlf.csv --- @@ -0,0 +1,7 @@ + +year,make,model,comment,blank +"2012",

[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-17 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-10-16 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 So Hadoop's LineReader looks like it handles CR, LF, CRLF: https://github.com/apache/hadoop/blob/f90c64e6242facf38c2baedeeda42e4a8293e642/hadoop-common-project/hadoop-common/src/main

[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-09 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/22680 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #22680: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-09 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22680 [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## Upstream SPARK-X ticket and PR link (if not applicable, explain) Went through review

[GitHub] spark pull request #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in...

2018-10-02 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/22503#discussion_r222053706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -212,6 +212,8 @@ class CSVOptions

[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-28 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 What does it take to get this to be merged in? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-26 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 Sounds good, thanks guys =) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #22503: [SPARK-25493][SQL] Use auto-detection for CRLF in CSV da...

2018-09-25 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/22503 It looks like a flake? Can someone retrigger it? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96511/console

[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

2018-09-20 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/22503 [SPARK-25493] [SQL] Fix multiline crlf ## What changes were proposed in this pull request? CSVs with windows style crlf (carriage return line feed) don't work in multiline mode

[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.

2017-10-30 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/19591 Really looking forward to this PR! For our use case, it will reduce our spark launch times by ~4 seconds

[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-08-15 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/15009 That would be incredible. Launching a new jvm and loading all of hadoop takes about 4 seconds extra each time, versus reusing the launcher jvm, which is really significant for us since we launch

[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-08-11 Thread justinuang
Github user justinuang commented on the issue: https://github.com/apache/spark/pull/15009 @kishorvpatil this will be quite useful for us! To avoid the 3s cost of spinning up a new jvm just for yarn-cluster --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-10-05 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r41220284 --- Diff: python/setup.py --- @@ -0,0 +1,18 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open("py

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-10-05 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r41221064 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-144187766 Thanks for the reminder! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-29 Thread justinuang
Github user justinuang closed the pull request at: https://github.com/apache/spark/pull/8662 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-21 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8833#discussion_r40048503 --- Diff: python/pyspark/sql/functions.py --- @@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None): def _create_judf(self, name

[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-21 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8833#issuecomment-142161491 lgtm! So this avoids deadlock by getting rid of the blocking queue (duh!) and then assumes the OS buffer will rate limit how much gets buffered on the writer side

[GitHub] spark pull request: [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Pyt...

2015-09-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8833#discussion_r39933648 --- Diff: python/pyspark/sql/functions.py --- @@ -1414,7 +1414,7 @@ def __init__(self, func, returnType, name=None): def _create_judf(self, name

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141121225 @rxin what do you mean by local iterators =) I feel like i'm missing some context that you guys have --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141117878 The solution with the iterator wrapper was my first approach that I prototyped (http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-17 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-141232211 I'm not sure there is a solution that satisfies all the requirements. I can say that this approach addresses 1,2,4 by design. Would you guys support a 1.6.0

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-140871937 Thanks! Sorry for being demanding, was just hoping to get this into 1.6.0! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140920743 @davies how do I have a private class in python? In addition, is it possible that the failing unit test is flaky? I ran ./run-tests --python

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140936346 Hey davies, I think the performance regression for a single UDF may be because there were multiple threads per task that could potentially be taking up CPU time. I

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-09-16 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-140866466 What is this blocking on? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140413982 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-15 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140416126 @rxin or @davies why is this automatically not retriggering when i push a new commit? Also, looks like the "retest this please" only works with

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140223207 Looks like your intuition was right. The second time it's slightly faster, so I ran the loop twice and took the 2nd's numbers Here are the updated numbers

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-14 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140181530 Sorry for the delay, here is the code I ran and here are the results from pyspark.sql.functions import udf from pyspark.sql.types import

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139331466 Hey davies, I don't have any numbers. Are there any benchmarks that we can just rerun? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-10 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139332688 Is there an example of another benchmark? I'm not sure where they're stored for python --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-09 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-139023500 Should the build have started by now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
GitHub user justinuang opened a pull request: https://github.com/apache/spark/pull/8662 [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF performance because of R… …DD caching - I wanted to reuse most of the logic from PythonRDD, so I pulled out two methods

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

2015-09-08 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-138758861 @davies @JoshRosen @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-10447][WIP][PYSPARK] upgrade pyspark to...

2015-09-07 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8615#issuecomment-138326562 I think we are missing some of the references to 0.8.2.1 git grep py4j- LICENSE:For Py4J (python/lib/py4j-0.8.2.1-src.zip) bin

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37563567 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. +import os +import sys

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37524370 --- Diff: python/setup.py --- @@ -0,0 +1,19 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open(pyspark

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37524804 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8318#issuecomment-132998065 @holdenk , thanks for working on this! Do we have plans to set up PyPI publishing? --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570377 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. +import os +import sys

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570006 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,31 @@ Finer-grained cache persistence levels. +import os +import sys

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-20 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37570574 --- Diff: python/pyspark/pyspark_version.py --- @@ -0,0 +1,17 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37459103 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,33 @@ Finer-grained cache persistence levels. +import os +import sys

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37458924 --- Diff: python/setup.py --- @@ -0,0 +1,19 @@ +#!/usr/bin/env python + +from setuptools import setup + +exec(compile(open(pyspark

[GitHub] spark pull request: [SPARK-1267][PYSPARK] Adds pip installer for p...

2015-08-19 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/8318#discussion_r37459213 --- Diff: python/pyspark/__init__.py --- @@ -36,6 +36,33 @@ Finer-grained cache persistence levels. +import os +import sys

[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...

2015-05-27 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/6439#discussion_r31166415 --- Diff: python/run-tests --- @@ -57,54 +57,54 @@ function run_test() { function run_core_tests() { echo Run core tests

[GitHub] spark pull request: [SPARK-7899][PYSPARK] Fix Python 3 pyspark/sql...

2015-05-27 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/6439#discussion_r31169697 --- Diff: python/run-tests --- @@ -57,54 +57,54 @@ function run_test() { function run_core_tests() { echo Run core tests

[GitHub] spark pull request: [SPARK-7329][MLLIB] simplify ParamGridBuilder ...

2015-05-03 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/5873#issuecomment-98541714 You can consider using set equality for the test, but other than that, it looks good! Thanks! --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...

2015-05-02 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/5601#issuecomment-98386185 Yea, you should try rebasing. It looks like you're not the only one running into this. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder

[GitHub] spark pull request: [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamG...

2015-04-30 Thread justinuang
Github user justinuang commented on a diff in the pull request: https://github.com/apache/spark/pull/5601#discussion_r29482941 --- Diff: python/pyspark/ml/tuning.py --- @@ -0,0 +1,94 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

2015-02-15 Thread justinuang
Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/3173#issuecomment-74416823 Hi, this looks great! Is there a reason why sort based join is not in spark core, only in spark SQL? --- If your project is set up for it, you can reply