[jira] [Created] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2494: - Summary: Hash of None is different cross machines in CPython Key: SPARK-2494 URL: https://issues.apache.org/jira/browse/SPARK-2494 Project: Spark Issue Type: Bug

[jira] [Created] (SPARK-2538) External aggregation in Python

2014-07-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2538: - Summary: External aggregation in Python Key: SPARK-2538 URL: https://issues.apache.org/jira/browse/SPARK-2538 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-17 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065240#comment-14065240 ] Davies Liu commented on SPARK-2494: --- This bug only happen in cluster mode, so it's can

[jira] [Commented] (SPARK-2494) Hash of None is different cross machines in CPython

2014-07-17 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065317#comment-14065317 ] Davies Liu commented on SPARK-2494: --- The tip version already handle hash of None, but it

[jira] [Created] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task

2014-07-22 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2630: - Summary: Input data size goes overflow when size is large then 4G in one task Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark

[jira] [Updated] (SPARK-2630) Input data size goes overflow when size is large then 4G in one task

2014-07-22 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2630: -- Attachment: overflow.tiff The input size is showed as 5.8MB, but the real input size is 4.3G. Input

[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-07-23 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2630: -- Summary: Input data size of CoalescedRDD is incorrect (was: Input data size goes overflow when size

[jira] [Created] (SPARK-2652) Turning default configurations for PySpark

2014-07-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2652: - Summary: Turning default configurations for PySpark Key: SPARK-2652 URL: https://issues.apache.org/jira/browse/SPARK-2652 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2014-07-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2653: - Summary: Heap size should be the sum of driver.memory and executor.memory in local mode Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project:

[jira] [Created] (SPARK-2654) Leveled logging in PySpark

2014-07-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2654: - Summary: Leveled logging in PySpark Key: SPARK-2654 URL: https://issues.apache.org/jira/browse/SPARK-2654 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-2655) Change the default logging level to WARN

2014-07-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2655: - Summary: Change the default logging level to WARN Key: SPARK-2655 URL: https://issues.apache.org/jira/browse/SPARK-2655 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-2672) support compressed file in wholeFile()

2014-07-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2672: - Summary: support compressed file in wholeFile() Key: SPARK-2672 URL: https://issues.apache.org/jira/browse/SPARK-2672 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-2674) Add date and time types to inferSchema

2014-07-25 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075272#comment-14075272 ] Davies Liu commented on SPARK-2674: --- Date and time in Python will be converted into

[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076849#comment-14076849 ] Davies Liu commented on SPARK-1687: --- Dill is implemented in pure Python, so it will have

[jira] [Commented] (SPARK-2655) Change the default logging level to WARN

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076863#comment-14076863 ] Davies Liu commented on SPARK-2655: --- [~pwendell] [~matei], how do you think about this?

[jira] [Commented] (SPARK-1343) PySpark OOMs without caching

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077011#comment-14077011 ] Davies Liu commented on SPARK-1343: --- Maybe it's related to partitionBy() with small

[jira] [Resolved] (SPARK-1343) PySpark OOMs without caching

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-1343. --- Resolution: Fixed Fix Version/s: 0.9.0 1.0.0 Target Version/s:

[jira] [Commented] (SPARK-1343) PySpark OOMs without caching

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077012#comment-14077012 ] Davies Liu commented on SPARK-1343: --- https://github.com/apache/spark/pull/1460

[jira] [Commented] (SPARK-2023) PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark.

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077075#comment-14077075 ] Davies Liu commented on SPARK-2023: --- In most cases, the result of reduce will be small,

[jira] [Commented] (SPARK-791) [pyspark] operator.getattr not serialized

2014-07-28 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077239#comment-14077239 ] Davies Liu commented on SPARK-791: -- This will be fixed by PR-1627[1] [1]

[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully

2014-07-29 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077465#comment-14077465 ] Davies Liu commented on SPARK-1630: --- If a RDD is generated in Scala/Java by user code,

[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-07-30 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080001#comment-14080001 ] Davies Liu commented on SPARK-2012: --- Yes, plz! PySpark StatCounter with numpy arrays

[jira] [Created] (SPARK-2789) Apply names to RDD to becoming SchemaRDD

2014-08-01 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2789: - Summary: Apply names to RDD to becoming SchemaRDD Key: SPARK-2789 URL: https://issues.apache.org/jira/browse/SPARK-2789 Project: Spark Issue Type: New Feature

[jira] [Created] (SPARK-2891) Daemon failed to launch worker

2014-08-06 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2891: - Summary: Daemon failed to launch worker Key: SPARK-2891 URL: https://issues.apache.org/jira/browse/SPARK-2891 Project: Spark Issue Type: Bug Components:

[jira] [Commented] (SPARK-2887) RDD.countApproxDistinct() is wrong when RDD has more one partition

2014-08-06 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088822#comment-14088822 ] Davies Liu commented on SPARK-2887: --- Yes, it only happens in master and 1.1, thanks.

[jira] [Updated] (SPARK-2898) Failed to connect to daemon

2014-08-07 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2898: -- Description: Java options:

[jira] [Updated] (SPARK-2898) Failed to connect to daemon

2014-08-07 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2898: -- Description: Java options:

[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093137#comment-14093137 ] Davies Liu commented on SPARK-1284: --- [~jblomo], could you reproduce this on master or

[jira] [Resolved] (SPARK-2891) Daemon failed to launch worker

2014-08-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-2891. --- Resolution: Duplicate Fix Version/s: 1.1.0 duplicated to 2898 Daemon failed to launch

[jira] [Created] (SPARK-2983) improve performance of sortByKey()

2014-08-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2983: - Summary: improve performance of sortByKey() Key: SPARK-2983 URL: https://issues.apache.org/jira/browse/SPARK-2983 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu commented on SPARK-1065: --- The broadcast was not used correctly in the above

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:33 AM: -

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:32 AM: -

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:34 AM: -

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:33 AM: -

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:34 AM: -

[jira] [Comment Edited] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094947#comment-14094947 ] Davies Liu edited comment on SPARK-1065 at 8/13/14 12:34 AM: -

[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094981#comment-14094981 ] Davies Liu commented on SPARK-1065: --- [~frol], I think broadcast the RDD object is

[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094986#comment-14094986 ] Davies Liu commented on SPARK-1065: --- After this patch, the above test can run

[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095150#comment-14095150 ] Davies Liu commented on SPARK-1065: --- Cool, thanks for the tests. If we can compress the

[jira] [Created] (SPARK-2999) Compress all the serialized data

2014-08-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2999: - Summary: Compress all the serialized data Key: SPARK-2999 URL: https://issues.apache.org/jira/browse/SPARK-2999 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3030) reuse python worker

2014-08-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3030: - Summary: reuse python worker Key: SPARK-3030 URL: https://issues.apache.org/jira/browse/SPARK-3030 Project: Spark Issue Type: Improvement Components:

[jira] [Created] (SPARK-3047) Use utf-8 for textFile() by default

2014-08-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3047: - Summary: Use utf-8 for textFile() by default Key: SPARK-3047 URL: https://issues.apache.org/jira/browse/SPARK-3047 Project: Spark Issue Type: Improvement

[jira] [Updated] (SPARK-3047) add an option to use str in textFileRDD()

2014-08-14 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3047: -- Summary: add an option to use str in textFileRDD() (was: Use utf-8 for textFile() by default) add

[jira] [Created] (SPARK-3073) improve large sort (external sort)

2014-08-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3073: - Summary: improve large sort (external sort) Key: SPARK-3073 URL: https://issues.apache.org/jira/browse/SPARK-3073 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3074) support groupByKey() with hot keys

2014-08-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3074: - Summary: support groupByKey() with hot keys Key: SPARK-3074 URL: https://issues.apache.org/jira/browse/SPARK-3074 Project: Spark Issue Type: Improvement

[jira] [Updated] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-15 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3073: -- Summary: improve large sort (external sort) for PySpark (was: improve large sort (external sort))

[jira] [Commented] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-15 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098785#comment-14098785 ] Davies Liu commented on SPARK-3073: --- This is for PySpark, currently we do not support

[jira] [Updated] (SPARK-3074) support groupByKey() with hot keys in PySpark

2014-08-15 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3074: -- Summary: support groupByKey() with hot keys in PySpark (was: support groupByKey() with hot keys)

[jira] [Created] (SPARK-3095) [PySpark] Speed up RDD.count()

2014-08-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3095: - Summary: [PySpark] Speed up RDD.count() Key: SPARK-3095 URL: https://issues.apache.org/jira/browse/SPARK-3095 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3141) sortByKey() break take()

2014-08-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3141: - Summary: sortByKey() break take() Key: SPARK-3141 URL: https://issues.apache.org/jira/browse/SPARK-3141 Project: Spark Issue Type: Bug Components:

[jira] [Created] (SPARK-3153) shuffle will run out of space when disks have different free space

2014-08-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3153: - Summary: shuffle will run out of space when disks have different free space Key: SPARK-3153 URL: https://issues.apache.org/jira/browse/SPARK-3153 Project: Spark

[jira] [Updated] (SPARK-3153) shuffle will run out of space when disks have different free space

2014-08-20 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3153: -- Description: If we have several disks in SPARK_LOCAL_DIRS, and one of them is much smaller than

[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-23 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108175#comment-14108175 ] Davies Liu commented on SPARK-2871: --- The fact is that these issues will be just a

[jira] [Created] (SPARK-3209) bump the version in banner

2014-08-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3209: - Summary: bump the version in banner Key: SPARK-3209 URL: https://issues.apache.org/jira/browse/SPARK-3209 Project: Spark Issue Type: Bug Components:

[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-08-25 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109512#comment-14109512 ] Davies Liu commented on SPARK-1764: --- This issue should be fixed in SPARK-2282 [1], I had

[jira] [Closed] (SPARK-3209) bump the version in banner

2014-08-25 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-3209. - Resolution: Invalid The version number is correct in branch-1.1. bump the version in banner

[jira] [Created] (SPARK-3239) Choose disks for spilling randomly

2014-08-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3239: - Summary: Choose disks for spilling randomly Key: SPARK-3239 URL: https://issues.apache.org/jira/browse/SPARK-3239 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3307) Fix doc string of SparkContext.broadcast()

2014-08-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3307: - Summary: Fix doc string of SparkContext.broadcast() Key: SPARK-3307 URL: https://issues.apache.org/jira/browse/SPARK-3307 Project: Spark Issue Type: Bug

[jira] [Created] (SPARK-3309) Put all public API in __all__

2014-08-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3309: - Summary: Put all public API in __all__ Key: SPARK-3309 URL: https://issues.apache.org/jira/browse/SPARK-3309 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3316) Spark driver will not exit after python program finished

2014-08-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3316: - Summary: Spark driver will not exit after python program finished Key: SPARK-3316 URL: https://issues.apache.org/jira/browse/SPARK-3316 Project: Spark Issue Type:

[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-08-31 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116935#comment-14116935 ] Davies Liu edited comment on SPARK- at 8/31/14 11:26 PM: -

[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-08-31 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116935#comment-14116935 ] Davies Liu edited comment on SPARK- at 8/31/14 11:27 PM: -

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-08-31 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117019#comment-14117019 ] Davies Liu commented on SPARK-: --- @joserosen This should not be the culprit, it just

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-02 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119400#comment-14119400 ] Davies Liu commented on SPARK-3336: --- [~marmbrus], If we reverse the order of count() and

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-03 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120141#comment-14120141 ] Davies Liu commented on SPARK-3336: --- [~kayfeng] The pyspark test case ran successfully

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-03 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120145#comment-14120145 ] Davies Liu commented on SPARK-3336: --- [~marmbrus] Should we merge this patch into 1.1

[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances

2014-09-03 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120792#comment-14120792 ] Davies Liu commented on SPARK-3358: --- I had created an PR to reuse Python worker:

[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123143#comment-14123143 ] Davies Liu commented on SPARK-3399: --- Could you give an example to show the problem?

[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123195#comment-14123195 ] Davies Liu commented on SPARK-3399: --- Thanks for the explain. I still can not reproduce

[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123214#comment-14123214 ] Davies Liu commented on SPARK-3399: --- Given fs.defaultFs as hdfs://, saveAsTextFile()

[jira] [Created] (SPARK-3420) Using Sphinx to generate API docs for PySpark

2014-09-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3420: - Summary: Using Sphinx to generate API docs for PySpark Key: SPARK-3420 URL: https://issues.apache.org/jira/browse/SPARK-3420 Project: Spark Issue Type:

[jira] [Created] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3465: - Summary: Task metrics are not aggregated correctly in local mode Key: SPARK-3465 URL: https://issues.apache.org/jira/browse/SPARK-3465 Project: Spark Issue Type:

[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3465: -- Description: In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same object

[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3465: -- Description: In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same object

[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3465: -- Description: In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same object

[jira] [Updated] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-09 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3465: -- Description: In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the same object

[jira] [Created] (SPARK-3491) Use pickle to serialize the data in MLlib Python

2014-09-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3491: - Summary: Use pickle to serialize the data in MLlib Python Key: SPARK-3491 URL: https://issues.apache.org/jira/browse/SPARK-3491 Project: Spark Issue Type:

[jira] [Resolved] (SPARK-1764) EOF reached before Python server acknowledged

2014-09-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-1764. --- Resolution: Fixed This is fixed by #2282 EOF reached before Python server acknowledged

[jira] [Updated] (SPARK-1764) EOF reached before Python server acknowledged

2014-09-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-1764: -- Fix Version/s: 1.1.0 EOF reached before Python server acknowledged

[jira] [Created] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3500: - Summary: SchemaRDD from jsonRDD() has not coalesce() method Key: SPARK-3500 URL: https://issues.apache.org/jira/browse/SPARK-3500 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3500: -- Description: {code} sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1)

[jira] [Updated] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-11 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3500: -- Description: ``` sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) Py4JError:

[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131714#comment-14131714 ] Davies Liu commented on SPARK-3500: --- I think it's a bug, there is a workaround for it:

[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131869#comment-14131869 ] Davies Liu commented on SPARK-3500: --- repartition() and distinct(N) are also missing too.

[jira] [Updated] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3500: -- Description: {code} sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1)

[jira] [Updated] (SPARK-3500) coalesce() and repartition() of SchemaRDD is broken

2014-09-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3500: -- Summary: coalesce() and repartition() of SchemaRDD is broken (was: SchemaRDD from jsonRDD() has not

[jira] [Updated] (SPARK-3500) coalesce() and repartition() of SchemaRDD is broken

2014-09-12 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3500: -- Description: {code} sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1)

[jira] [Created] (SPARK-3524) remove workaround to pickle array of float for Pyrolite

2014-09-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3524: - Summary: remove workaround to pickle array of float for Pyrolite Key: SPARK-3524 URL: https://issues.apache.org/jira/browse/SPARK-3524 Project: Spark Issue Type:

[jira] [Created] (SPARK-3554) handle large dataset in closure of PySpark

2014-09-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3554: - Summary: handle large dataset in closure of PySpark Key: SPARK-3554 URL: https://issues.apache.org/jira/browse/SPARK-3554 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3679: - Summary: pickle the exact globals of functions Key: SPARK-3679 URL: https://issues.apache.org/jira/browse/SPARK-3679 Project: Spark Issue Type: Bug

[jira] [Created] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3681: - Summary: Failed to serialized ArrayType or MapType after accessing them in Python Key: SPARK-3681 URL: https://issues.apache.org/jira/browse/SPARK-3681 Project: Spark

[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-25 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148385#comment-14148385 ] Davies Liu commented on SPARK-2377: --- [~giwa] I also start to work on this (based on your

[jira] [Commented] (SPARK-3420) Using Sphinx to generate API docs for PySpark

2014-09-27 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150654#comment-14150654 ] Davies Liu commented on SPARK-3420: --- This is partially fixed by

[jira] [Created] (SPARK-3743) noisy logging when context is stopped

2014-09-30 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3743: - Summary: noisy logging when context is stopped Key: SPARK-3743 URL: https://issues.apache.org/jira/browse/SPARK-3743 Project: Spark Issue Type: Improvement

[jira] [Created] (SPARK-3749) Bugs in broadcast of large RDD

2014-09-30 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3749: - Summary: Bugs in broadcast of large RDD Key: SPARK-3749 URL: https://issues.apache.org/jira/browse/SPARK-3749 Project: Spark Issue Type: Bug

[jira] [Created] (SPARK-3762) clear all SparkEnv references after stop

2014-10-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3762: - Summary: clear all SparkEnv references after stop Key: SPARK-3762 URL: https://issues.apache.org/jira/browse/SPARK-3762 Project: Spark Issue Type: Bug

[jira] [Resolved] (SPARK-1284) pyspark hangs after IOError on Executor

2014-10-02 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-1284. --- Resolution: Fixed Fix Version/s: 1.1.0 I think this is an logging issue ,should be fixed by

[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-2630. --- Resolution: Fixed Fix Version/s: 1.2.0 Input data size of CoalescedRDD is incorrect

[jira] [Reopened] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-2630: --- not merged yet, sorry. Input data size of CoalescedRDD is incorrect

  1   2   3   4   5   6   7   8   9   10   >