Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Davies Liu
+1

On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.3
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.3-rc2
> (1e860747458d74a4ccbd081103a0542a2367b14b)
>
> This release candidate addresses 52 JIRA tickets:
> https://s.apache.org/spark-1.6.3-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1212/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>
>
> ===
> == How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.2.
>
> 
> == What justifies a -1 vote for this release?
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present in
> 1.6.2, missing features, or bugs related to new features will not
> necessarily block this release.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-24 Thread Davies Liu
Using 0 for spark.mesos.mesosExecutor.cores is better than dynamic
allocation, but have to pay a little more overhead for launching a
task, which should be OK if the task is not trivial.

Since the direct result (up to 1M by default) will also go through
mesos, it's better to tune it lower, otherwise mesos could become the
bottleneck.

spark.task.maxDirectResultSize

On Mon, Dec 19, 2016 at 3:23 PM, Chawla,Sumit  wrote:
> Tim,
>
> We will try to run the application in coarse grain mode, and share the
> findings with you.
>
> Regards
> Sumit Chawla
>
>
> On Mon, Dec 19, 2016 at 3:11 PM, Timothy Chen  wrote:
>
>> Dynamic allocation works with Coarse grain mode only, we wasn't aware
>> a need for Fine grain mode after we enabled dynamic allocation support
>> on the coarse grain mode.
>>
>> What's the reason you're running fine grain mode instead of coarse
>> grain + dynamic allocation?
>>
>> Tim
>>
>> On Mon, Dec 19, 2016 at 2:45 PM, Mehdi Meziane
>>  wrote:
>> > We will be interested by the results if you give a try to Dynamic
>> allocation
>> > with mesos !
>> >
>> >
>> > - Mail Original -
>> > De: "Michael Gummelt" 
>> > À: "Sumit Chawla" 
>> > Cc: u...@mesos.apache.org, d...@mesos.apache.org, "User"
>> > , dev@spark.apache.org
>> > Envoyé: Lundi 19 Décembre 2016 22h42:55 GMT +01:00 Amsterdam / Berlin /
>> > Berne / Rome / Stockholm / Vienne
>> > Objet: Re: Mesos Spark Fine Grained Execution - CPU count
>> >
>> >
>> >> Is this problem of idle executors sticking around solved in Dynamic
>> >> Resource Allocation?  Is there some timeout after which Idle executors
>> can
>> >> just shutdown and cleanup its resources.
>> >
>> > Yes, that's exactly what dynamic allocation does.  But again I have no
>> idea
>> > what the state of dynamic allocation + mesos is.
>> >
>> > On Mon, Dec 19, 2016 at 1:32 PM, Chawla,Sumit 
>> > wrote:
>> >>
>> >> Great.  Makes much better sense now.  What will be reason to have
>> >> spark.mesos.mesosExecutor.cores more than 1, as this number doesn't
>> include
>> >> the number of cores for tasks.
>> >>
>> >> So in my case it seems like 30 CPUs are allocated to executors.  And
>> there
>> >> are 48 tasks so 48 + 30 =  78 CPUs.  And i am noticing this gap of 30 is
>> >> maintained till the last task exits.  This explains the gap.   Thanks
>> >> everyone.  I am still not sure how this number 30 is calculated.  ( Is
>> it
>> >> dynamic based on current resources, or is it some configuration.  I
>> have 32
>> >> nodes in my cluster).
>> >>
>> >> Is this problem of idle executors sticking around solved in Dynamic
>> >> Resource Allocation?  Is there some timeout after which Idle executors
>> can
>> >> just shutdown and cleanup its resources.
>> >>
>> >>
>> >> Regards
>> >> Sumit Chawla
>> >>
>> >>
>> >> On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt <
>> mgumm...@mesosphere.io>
>> >> wrote:
>> >>>
>> >>> >  I should preassume that No of executors should be less than number
>> of
>> >>> > tasks.
>> >>>
>> >>> No.  Each executor runs 0 or more tasks.
>> >>>
>> >>> Each executor consumes 1 CPU, and each task running on that executor
>> >>> consumes another CPU.  You can customize this via
>> >>> spark.mesos.mesosExecutor.cores
>> >>> (https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md)
>> and
>> >>> spark.task.cpus
>> >>> (https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md)
>> >>>
>> >>> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit > >
>> >>> wrote:
>> 
>>  Ah thanks. looks like i skipped reading this "Neither will executors
>>  terminate when they’re idle."
>> 
>>  So in my job scenario,  I should preassume that No of executors should
>>  be less than number of tasks. Ideally one executor should execute 1
>> or more
>>  tasks.  But i am observing something strange instead.  I start my job
>> with
>>  48 partitions for a spark job. In mesos ui i see that number of tasks
>> is 48,
>>  but no. of CPUs is 78 which is way more than 48.  Here i am assuming
>> that 1
>>  CPU is 1 executor.   I am not specifying any configuration to set
>> number of
>>  cores per executor.
>> 
>>  Regards
>>  Sumit Chawla
>> 
>> 
>>  On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere
>>   wrote:
>> >
>> > That makes sense. From the documentation it looks like the executors
>> > are not supposed to terminate:
>> >
>> > http://spark.apache.org/docs/latest/running-on-mesos.html#
>> fine-grained-deprecated
>> >>
>> >> Note that while Spark tasks in fine-grained will relinquish cores as
>> >> they terminate, they will not relinquish memory, as the JVM does
>> not give
>> >> memory back to the Operating System. Neither will executors
>> terminate when
>> >> they’re idle.
>> >
>> >
>> > I suppose your task to executor CPU ratio is low enough that it looks
>> > like most of the resources are not being reclaimed. If your tasks
>> were using
>> > s

Re: ShuffledHashJoin Possible Issue

2015-10-19 Thread Davies Liu
Can you reproduce it on master?

I can't reproduce it with the following code:

>>> t2 = sqlContext.range(50).selectExpr("concat('A', id) as id")
>>> t1 = sqlContext.range(10).selectExpr("concat('A', id) as id")
>>> t1.join(t2).where(t1.id == t2.id).explain()
ShuffledHashJoin [id#21], [id#19], BuildRight
 TungstenExchange hashpartitioning(id#21,200)
  TungstenProject [concat(A,cast(id#20L as string)) AS id#21]
   Scan PhysicalRDD[id#20L]
 TungstenExchange hashpartitioning(id#19,200)
  TungstenProject [concat(A,cast(id#18L as string)) AS id#19]
   Scan PhysicalRDD[id#18L]

>>> t1.join(t2).where(t1.id == t2.id).count()
10


On Mon, Oct 19, 2015 at 2:59 AM, gsvic  wrote:
> Hi Hao,
>
> Each table is created with the following python code snippet:
>
> data = [{'id': 'A%d'%i, 'value':ceil(random()*10)} for i in range(0,50)]
> with open('A.json', 'w+') as output:
> json.dump(data, output)
>
> The tables A and B containing 10 and 50 tuples respectively.
>
> In spark shell I type
>
> sqlContext.setConf("spark.sql.planner.sortMergeJoin", "false") to disable
> sortMergeJoin and
> sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "0") to disable
> BroadcastHashJoin, cause the tables are too small and this join will be
> selected.
>
> Finally I run the following query:
> t1.join(t2).where(t1("id").equalTo(t2("id"))).count
>
> and the result I get equals to zero, while ShuffledHashJoin and
> SortMergeJoin returns the right result (10).
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/ShuffledHashJoin-Possible-Issue-tp14672p14682.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: pyspark with pypy not work for spark-1.5.1

2015-11-13 Thread Davies Liu
We already test CPython 2.6, CPython 3.4 and PyPy 2.5, it took more
than 30 min to run (without parallelization),
I think it should be enough.

PyPy 2.2 is too old that we have not enough resource to support that.

On Fri, Nov 6, 2015 at 2:27 AM, Chang Ya-Hsuan  wrote:
> Hi I run ./python/ru-tests to test following modules of spark-1.5.1:
>
> [pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql',
> 'pyspark-streaming]
>
> against to following pypy versions:
>
> pypy-2.2.1  pypy-2.3  pypy-2.3.1  pypy-2.4.0  pypy-2.5.0  pypy-2.5.1
> pypy-2.6.0  pypy-2.6.1  pypy-4.0.0
>
> except pypy-2.2.1, all others pass the test.
>
> the error message of pypy-2.2.1 is:
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/runpy.py",
> line 151, in _run_module_as_main
> mod_name, loader, code, fname = _get_module_details(mod_name)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/runpy.py",
> line 101, in _get_module_details
> loader = get_loader(mod_name)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 465, in get_loader
> return find_loader(fullname)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 475, in find_loader
> for importer in iter_importers(fullname):
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 431, in iter_importers
> __import__(pkg)
>   File "pyspark/__init__.py", line 41, in 
> from pyspark.context import SparkContext
>   File "pyspark/context.py", line 26, in 
> from pyspark import accumulators
>   File "pyspark/accumulators.py", line 98, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File "pyspark/serializers.py", line 400, in 
> _hijack_namedtuple()
>   File "pyspark/serializers.py", line 378, in _hijack_namedtuple
> _old_namedtuple = _copy_func(collections.namedtuple)
>   File "pyspark/serializers.py", line 376, in _copy_func
> f.__defaults__, f.__closure__)
> AttributeError: 'function' object has no attribute '__closure__'
>
> p.s. would you want to test different pypy versions on your Jenkins? maybe I
> could help
>
> On Fri, Nov 6, 2015 at 2:23 AM, Josh Rosen  wrote:
>>
>> You could try running PySpark's own unit tests. Try ./python/run-tests
>> --help for instructions.
>>
>> On Thu, Nov 5, 2015 at 12:31 AM Chang Ya-Hsuan  wrote:
>>>
>>> I've test on following pypy version against to spark-1.5.1
>>>
>>>   pypy-2.2.1
>>>   pypy-2.3
>>>   pypy-2.3.1
>>>   pypy-2.4.0
>>>   pypy-2.5.0
>>>   pypy-2.5.1
>>>   pypy-2.6.0
>>>   pypy-2.6.1
>>>
>>> I run
>>>
>>> $ PYSPARK_PYTHON=/path/to/pypy-xx.xx/bin/pypy
>>> /path/to/spark-1.5.1/bin/pyspark
>>>
>>> and only pypy-2.2.1 failed.
>>>
>>> Any suggestion to run advanced test?
>>>
>>> On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan 
>>> wrote:

 Thanks for your quickly reply.

 I will test several pypy versions and report the result later.

 On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen  wrote:
>
> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
> version to see if that works?
>
> I just checked and it looks like our Jenkins tests are running against
> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
> minimum supported PyPy version is. Would you be interested in helping to
> investigate so that we can update the documentation or produce a fix to
> restore compatibility with earlier PyPy builds?
>
> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan 
> wrote:
>>
>> Hi all,
>>
>> I am trying to run pyspark with pypy, and it is work when using
>> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>>
>> my pypy version:
>>
>> $ /usr/bin/pypy --version
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4]
>>
>> works with spark-1.3.1
>>
>> $ PYSPARK_PYTHON=/usr/bin/pypy
>> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
>> loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
>> eth0)
>> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind
>> to another address
>> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/__ / .__/

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-03 Thread Davies Liu
Does this https://github.com/apache/spark/pull/10134 is valid fix?
(still worse than 1.5)

On Thu, Dec 3, 2015 at 8:45 AM, mkhaitman  wrote:
> I reported this in the 1.6 preview thread, but wouldn't mind if someone can
> confirm that ctrl-c is not keyboard interrupting / clearing the current line
> of input anymore in the pyspark shell. I saw the change that would kill the
> currently running job when using ctrl+c, but now the only way to clear the
> current line of input is to simply hit enter (throwing an exception). Anyone
> else seeing this?
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-6-0-RC1-tp15424p15450.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1

On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
 wrote:
> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
> 2.6 is ancient history and the core Python developers stopped supporting it
> in 2013. REHL 5 is not a good enough reason to continue support for Python
> 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
> currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:
>>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661

On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers  wrote:
> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the app
> we can not ship it with our software because its gpl licensed, so the client
> would have to download it and install it themselves, and this would mean its
> an independent install which has to be audited and approved and now you are
> in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen  wrote:
>>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x
>> without impacting / changing the system Python and doesn't require any
>> special permissions to install (you don't need root / sudo access). Does
>> this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers  wrote:
>>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas
>>>  wrote:

 As I pointed out in my earlier email, RHEL will support Python 2.6 until
 2020. So I'm assuming these large companies will have the option of riding
 out Python 2.6 until then.

 Are we seriously saying that Spark should likewise support Python 2.6
 for the next several years? Even though the core Python devs stopped
 supporting it in 2013?

 If that's not what we're suggesting, then when, roughly, can we drop
 support? What are the criteria?

 I understand the practical concern here. If companies are stuck using
 2.6, it doesn't matter to them that it is deprecated. But balancing that
 concern against the maintenance burden on this project, I would say that
 "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
 take. There are many tiny annoyances one has to put up with to support 2.6.

 I suppose if our main PySpark contributors are fine putting up with
 those annoyances, then maybe we don't need to drop support just yet...

 Nick
 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente
 님이 작성:
>
> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6
> is old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers  escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the
> only option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland
>  wrote:
>>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At
>> this point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind
>> the version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
>>  wrote:
>>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes,
>>> Python 2.6 is ancient history and the core Python developers stopped
>>> supporting it in 2013. REHL 5 is not a good enough reason to continue
>>> support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe
>>> we currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang 
>>> wrote:

 plus 1,

 we are currently using python 2.7.2 in production environment.





 在 2016-01-05 18:11:45

Re: Making BatchPythonEvaluation actually Batch

2016-02-11 Thread Davies Liu
Had a quick look in your commit, I think that make sense, could you
send a PR for that, then we can review it.

In order to support 2), we need to change the serialized Python
function from `f(iter)` to `f(x)`, process one row at a time (not a
partition),
then we can easily combine them together:

for f1(f2(x))  and g1(g2(x)), we can do this in Python:

for row in reading_stream:
   x1, x2 = row
   y1 = f1(f2(x1))
   y2 = g1(g2(x2))
   yield (y1, y2)

For RDD, we still need to use `f(iter)`, but for SQL UDF, use `f(x)`.

On Sun, Jan 31, 2016 at 1:37 PM, Justin Uang  wrote:
> Hey guys,
>
> BLUF: sorry for the length of this email, trying to figure out how to batch
> Python UDF executions, and since this is my first time messing with
> catalyst, would like any feedback
>
> My team is starting to use PySpark UDFs quite heavily, and performance is a
> huge blocker. The extra roundtrip serialization from Java to Python is not a
> huge concern if we only incur it ~once per column for most workflows, since
> it'll be in the same order of magnitude as reading files from disk. However,
> right now each Python UDFs lead to a single roundtrip. There is definitely a
> lot we can do regarding this:
>
> (all the prototyping code is here:
> https://github.com/justinuang/spark/commit/8176749f8a6e6dc5a49fbbb952735ff40fb309fc)
>
> 1. We can't chain Python UDFs.
>
> df.select(python_times_2(python_times_2("col1")))
>
> throws an exception saying that the inner expression isn't evaluable. The
> workaround is to do
>
>
> df.select(python_times_2("col1").alias("tmp")).select(python_time_2("tmp"))
>
> This can be solved in ExtractPythonUDFs by always extracting the inner most
> Python UDF first.
>
>  // Pick the UDF we are going to evaluate (TODO: Support evaluating
> multiple UDFs at a time)
>  // If there is more than one, we will add another evaluation
> operator in a subsequent pass.
> -udfs.find(_.resolved) match {
> +udfs.find { udf =>
> +  udf.resolved && udf.children.map { child: Expression =>
> +child.find { // really hacky way to find if a child of a udf
> has the PythonUDF node
> +  case p: PythonUDF => true
> +  case _ => false
> +}.isEmpty
> +  }.reduce((x, y) => x && y)
> +} match {
>case Some(udf) =>
>  var evaluation: EvaluatePython = null
>
> 2. If we have a Python UDF applied to many different columns, where they
> don’t depend on each other, we can optimize them by collapsing them down
> into a single python worker. Although we have to serialize and send the same
> amount of data to the python interpreter, in the case where I am applying
> the same function to 20 columns, the overhead/context_switches of having 20
> interpreters run at the same time causes huge performance hits. I have
> confirmed this by manually taking the 20 columns, converting them to a
> struct, and then writing a UDF that processes the struct at the same time,
> and the speed difference is 2x. My approach to adding this to catalyst is
> basically to write an optimizer rule called CombinePython which joins
> adjacent EvaluatePython nodes that don’t depend on each other’s variables,
> and then having BatchPythonEvaluation run multiple lambdas at once. I would
> also like to be able to handle the case
> df.select(python_times_2(“col1”).alias(“col1x2”)).select(F.col(“col1x2”),
> python_times_2(“col1x2”).alias(“col1x4”)). To get around that, I add a
> PushDownPythonEvaluation optimizer that will push the optimization through a
> select/project, so that the CombinePython rule can join the two.
>
> 3. I would like CombinePython to be able to handle UDFs that chain off of
> each other.
>
> df.select(python_times_2(python_times_2(“col1”)))
>
> I haven’t prototyped this yet, since it’s a lot more complex. The way I’m
> thinking about this is to still have a rule called CombinePython, except
> that the BatchPythonEvaluation will need to be smart enough to build up the
> dag of dependencies, and then feed that information to the python
> interpreter, so it can compute things in the right order, and reuse the
> in-memory objects that it has already computed. Does this seem right? Should
> the code mainly be in BatchPythonEvaluation? In addition, we will need to
> change up the protocol between the java and python sides to support sending
> this information. What is acceptable?
>
> Any help would be much appreciated! Especially w.r.t where to the design
> choices such that the PR that has a chance of being accepted.
>
> Justin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
UnsafeHashedRelation and HashedRelation could also be used in Executor
(for non-broadcast hash join), then the UnsafeRow could come from
UnsafeProjection,
so We should copy the rows for safety.

We could have a smarter copy() for UnsafeRow (avoid the copy if it's
already copied),
but I don't think this copy here will increase the memory pressure.
The total memory
will be determined by how many rows are stored in the hash tables.

In general, if you do not have enough memory, just don't increase
autoBroadcastJoinThreshold,
or the performance could be worse because of full GC.

Sometimes the tables looks small as compressed files (for example,
parquet file),
once it's loaded into memory, it could required much more memory than the size
of file on disk.


On Tue, Mar 1, 2016 at 5:17 PM, Matt Cheah  wrote:
> Hi everyone,
>
> I had a quick question regarding our implementation of UnsafeHashedRelation
> and HashedRelation. It appears that we copy the rows that we’ve collected
> into memory upon inserting them into the hash table in
> UnsafeHashedRelation#apply(). I was wondering why we are copying the rows
> every time? I can’t imagine these rows being mutable in this scenario.
>
> The context is that I’m looking into a case where a small data frame should
> fit in the driver’s memory, but my driver ran out of memory after I
> increased the autoBroadcastJoinThreshold. YourKit is indicating that this
> logic is consuming more memory than my driver can handle.
>
> Thanks,
>
> -Matt Cheah

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
I see, we could reduce the memory by moving the copy out of the HashedRelation,
then we should do the copy before call HashedRelation for shuffle hash join.

Another things is that when we do broadcasting, we will have another
serialized copy
of hash table.

For the table that's larger than 100M, we may not suggest to use Broadcast join,
because it take time to send it to every executor also take the same amount of
memory on every executor.

On Wed, Mar 2, 2016 at 10:45 AM, Matt Cheah  wrote:
> I would expect the memory pressure to grow because not only are we storing
> the backing array to the iterator of the rows on the driver, but we’re
> also storing a copy of each of those rows in the hash table. Whereas if we
> didn’t do the copy on the drive side then the hash table would only have
> to store pointers to those rows in the array. Perhaps we can think about
> whether or not we want to be using the HashedRelation constructs in
> broadcast join physical plans?
>
> The file isn’t compressed - it was a 150MB CSV living on HDFS. I would
> expect it to fit in a 1GB heap, but I agree that it is difficult to reason
> about dataset size on disk vs. memory.
>
> -Matt Cheah
>
> On 3/2/16, 10:15 AM, "Davies Liu"  wrote:
>
>>UnsafeHashedRelation and HashedRelation could also be used in Executor
>>(for non-broadcast hash join), then the UnsafeRow could come from
>>UnsafeProjection,
>>so We should copy the rows for safety.
>>
>>We could have a smarter copy() for UnsafeRow (avoid the copy if it's
>>already copied),
>>but I don't think this copy here will increase the memory pressure.
>>The total memory
>>will be determined by how many rows are stored in the hash tables.
>>
>>In general, if you do not have enough memory, just don't increase
>>autoBroadcastJoinThreshold,
>>or the performance could be worse because of full GC.
>>
>>Sometimes the tables looks small as compressed files (for example,
>>parquet file),
>>once it's loaded into memory, it could required much more memory than the
>>size
>>of file on disk.
>>
>>
>>On Tue, Mar 1, 2016 at 5:17 PM, Matt Cheah  wrote:
>>> Hi everyone,
>>>
>>> I had a quick question regarding our implementation of
>>>UnsafeHashedRelation
>>> and HashedRelation. It appears that we copy the rows that we’ve
>>>collected
>>> into memory upon inserting them into the hash table in
>>> UnsafeHashedRelation#apply(). I was wondering why we are copying the
>>>rows
>>> every time? I can’t imagine these rows being mutable in this scenario.
>>>
>>> The context is that I’m looking into a case where a small data frame
>>>should
>>> fit in the driver’s memory, but my driver ran out of memory after I
>>> increased the autoBroadcastJoinThreshold. YourKit is indicating that
>>>this
>>> logic is consuming more memory than my driver can handle.
>>>
>>> Thanks,
>>>
>>> -Matt Cheah
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-07 Thread Davies Liu
The underlying buffer for UnsafeRow is reused in UnsafeProjection.

On Thu, Mar 3, 2016 at 9:11 PM, Rishi Mishra  wrote:
> Hi Davies,
> When you say "UnsafeRow could come from UnsafeProjection, so We should copy
> the rows for safety."  do you intend to say that the underlying state might
> change , because of some state update APIs ?
> Or its due to some other rationale ?
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Mar 3, 2016 at 3:59 AM, Davies Liu  wrote:
>>
>> I see, we could reduce the memory by moving the copy out of the
>> HashedRelation,
>> then we should do the copy before call HashedRelation for shuffle hash
>> join.
>>
>> Another things is that when we do broadcasting, we will have another
>> serialized copy
>> of hash table.
>>
>> For the table that's larger than 100M, we may not suggest to use Broadcast
>> join,
>> because it take time to send it to every executor also take the same
>> amount of
>> memory on every executor.
>>
>> On Wed, Mar 2, 2016 at 10:45 AM, Matt Cheah  wrote:
>> > I would expect the memory pressure to grow because not only are we
>> > storing
>> > the backing array to the iterator of the rows on the driver, but we’re
>> > also storing a copy of each of those rows in the hash table. Whereas if
>> > we
>> > didn’t do the copy on the drive side then the hash table would only have
>> > to store pointers to those rows in the array. Perhaps we can think about
>> > whether or not we want to be using the HashedRelation constructs in
>> > broadcast join physical plans?
>> >
>> > The file isn’t compressed - it was a 150MB CSV living on HDFS. I would
>> > expect it to fit in a 1GB heap, but I agree that it is difficult to
>> > reason
>> > about dataset size on disk vs. memory.
>> >
>> > -Matt Cheah
>> >
>> > On 3/2/16, 10:15 AM, "Davies Liu"  wrote:
>> >
>> >>UnsafeHashedRelation and HashedRelation could also be used in Executor
>> >>(for non-broadcast hash join), then the UnsafeRow could come from
>> >>UnsafeProjection,
>> >>so We should copy the rows for safety.
>> >>
>> >>We could have a smarter copy() for UnsafeRow (avoid the copy if it's
>> >>already copied),
>> >>but I don't think this copy here will increase the memory pressure.
>> >>The total memory
>> >>will be determined by how many rows are stored in the hash tables.
>> >>
>> >>In general, if you do not have enough memory, just don't increase
>> >>autoBroadcastJoinThreshold,
>> >>or the performance could be worse because of full GC.
>> >>
>> >>Sometimes the tables looks small as compressed files (for example,
>> >>parquet file),
>> >>once it's loaded into memory, it could required much more memory than
>> >> the
>> >>size
>> >>of file on disk.
>> >>
>> >>
>> >>On Tue, Mar 1, 2016 at 5:17 PM, Matt Cheah  wrote:
>> >>> Hi everyone,
>> >>>
>> >>> I had a quick question regarding our implementation of
>> >>>UnsafeHashedRelation
>> >>> and HashedRelation. It appears that we copy the rows that we’ve
>> >>>collected
>> >>> into memory upon inserting them into the hash table in
>> >>> UnsafeHashedRelation#apply(). I was wondering why we are copying the
>> >>>rows
>> >>> every time? I can’t imagine these rows being mutable in this scenario.
>> >>>
>> >>> The context is that I’m looking into a case where a small data frame
>> >>>should
>> >>> fit in the driver’s memory, but my driver ran out of memory after I
>> >>> increased the autoBroadcastJoinThreshold. YourKit is indicating that
>> >>>this
>> >>> logic is consuming more memory than my driver can handle.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> -Matt Cheah
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Making BatchPythonEvaluation actually Batch

2016-03-31 Thread Davies Liu
@Justin, it's fixed by https://github.com/apache/spark/pull/12057

On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu  wrote:
> Had a quick look in your commit, I think that make sense, could you
> send a PR for that, then we can review it.
>
> In order to support 2), we need to change the serialized Python
> function from `f(iter)` to `f(x)`, process one row at a time (not a
> partition),
> then we can easily combine them together:
>
> for f1(f2(x))  and g1(g2(x)), we can do this in Python:
>
> for row in reading_stream:
>x1, x2 = row
>y1 = f1(f2(x1))
>y2 = g1(g2(x2))
>yield (y1, y2)
>
> For RDD, we still need to use `f(iter)`, but for SQL UDF, use `f(x)`.
>
> On Sun, Jan 31, 2016 at 1:37 PM, Justin Uang  wrote:
>> Hey guys,
>>
>> BLUF: sorry for the length of this email, trying to figure out how to batch
>> Python UDF executions, and since this is my first time messing with
>> catalyst, would like any feedback
>>
>> My team is starting to use PySpark UDFs quite heavily, and performance is a
>> huge blocker. The extra roundtrip serialization from Java to Python is not a
>> huge concern if we only incur it ~once per column for most workflows, since
>> it'll be in the same order of magnitude as reading files from disk. However,
>> right now each Python UDFs lead to a single roundtrip. There is definitely a
>> lot we can do regarding this:
>>
>> (all the prototyping code is here:
>> https://github.com/justinuang/spark/commit/8176749f8a6e6dc5a49fbbb952735ff40fb309fc)
>>
>> 1. We can't chain Python UDFs.
>>
>> df.select(python_times_2(python_times_2("col1")))
>>
>> throws an exception saying that the inner expression isn't evaluable. The
>> workaround is to do
>>
>>
>> df.select(python_times_2("col1").alias("tmp")).select(python_time_2("tmp"))
>>
>> This can be solved in ExtractPythonUDFs by always extracting the inner most
>> Python UDF first.
>>
>>  // Pick the UDF we are going to evaluate (TODO: Support evaluating
>> multiple UDFs at a time)
>>  // If there is more than one, we will add another evaluation
>> operator in a subsequent pass.
>> -udfs.find(_.resolved) match {
>> +udfs.find { udf =>
>> +  udf.resolved && udf.children.map { child: Expression =>
>> +child.find { // really hacky way to find if a child of a udf
>> has the PythonUDF node
>> +  case p: PythonUDF => true
>> +  case _ => false
>> +}.isEmpty
>> +  }.reduce((x, y) => x && y)
>> +} match {
>>case Some(udf) =>
>>  var evaluation: EvaluatePython = null
>>
>> 2. If we have a Python UDF applied to many different columns, where they
>> don’t depend on each other, we can optimize them by collapsing them down
>> into a single python worker. Although we have to serialize and send the same
>> amount of data to the python interpreter, in the case where I am applying
>> the same function to 20 columns, the overhead/context_switches of having 20
>> interpreters run at the same time causes huge performance hits. I have
>> confirmed this by manually taking the 20 columns, converting them to a
>> struct, and then writing a UDF that processes the struct at the same time,
>> and the speed difference is 2x. My approach to adding this to catalyst is
>> basically to write an optimizer rule called CombinePython which joins
>> adjacent EvaluatePython nodes that don’t depend on each other’s variables,
>> and then having BatchPythonEvaluation run multiple lambdas at once. I would
>> also like to be able to handle the case
>> df.select(python_times_2(“col1”).alias(“col1x2”)).select(F.col(“col1x2”),
>> python_times_2(“col1x2”).alias(“col1x4”)). To get around that, I add a
>> PushDownPythonEvaluation optimizer that will push the optimization through a
>> select/project, so that the CombinePython rule can join the two.
>>
>> 3. I would like CombinePython to be able to handle UDFs that chain off of
>> each other.
>>
>> df.select(python_times_2(python_times_2(“col1”)))
>>
>> I haven’t prototyped this yet, since it’s a lot more complex. The way I’m
>> thinking about this is to still have a rule called CombinePython, except
>> that the BatchPythonEvaluation will need to be smart enough to build up the
>> dag of dependencies, and then feed that information to the python
>> interpreter, so it can compute things in the right order, and reuse th

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Davies Liu
+1 (non-binding)

On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley  wrote:
> +1
>
> On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee  wrote:
>>
>> +1 (non-binding)
>> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang  wrote:
>>>
>>> +1
>>>
>>> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu
>>>  wrote:

 +1

 On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee 
 wrote:
>
> +1
>
>
> On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier
>  wrote:
>>
>> +1 (non-binding)
>>
>> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida
>>  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Built and tested on
>>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
>>> - CentOS / Oracle Java 1.7.0_55
>>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver
>>> -Pyarn)
>>>
>>>
>>> On 25 September 2016 at 22:35, Matei Zaharia
>>>  wrote:

 +1

 Matei

 On Sep 25, 2016, at 1:25 PM, Josh Rosen 
 wrote:

 +1

 On Sun, Sep 25, 2016 at 1:16 PM Yin Huai 
 wrote:
>
> +1
>
> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun
>  wrote:
>>
>> +1 (non binding)
>>
>> RC3 is compiled and tested on the following two systems, too. All
>> tests passed.
>>
>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserver -Dsparkr
>> * CentOS 7.2 / Open JDK 1.8.0_102
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserver
>>
>> Cheers,
>> Dongjoon
>>
>>
>>
>> On Saturday, September 24, 2016, Reynold Xin 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 
>>> PDT and
>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc3
>>> (9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
>>>
>>> This release candidate resolves 290 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be
>>> found at:
>>>
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1201/
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by
>>> taking an existing Spark workload and running on this release 
>>> candidate,
>>> then reporting any regressions from 2.0.0.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series.  Bugs
>>> already present in 2.0.0, missing features, or bugs related to new 
>>> features
>>> will not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into
>>> branch-2.0 from now on?
>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a
>>> new RC (i.e. RC4) is cut, I will change the fix version of those 
>>> patches to
>>> 2.0.1.
>>>
>>>
>

>>>
>>
>

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Davies Liu
+1

On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin  wrote:
> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1
> (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>
> This release candidate resolves 75 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Davies Liu
+1 for Matei's point.

On Thu, Oct 27, 2016 at 8:36 AM, Matei Zaharia  wrote:
> Just to comment on this, I'm generally against removing these types of
> things unless they create a substantial burden on project contributors. It
> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
> but then of course we need to wait for 2.12 to be out and stable.
>
> In general, this type of stuff only hurts users, and doesn't have a huge
> impact on Spark contributors' productivity (sure, it's a bit unpleasant, but
> that's life). If we break compatibility this way too quickly, we fragment
> the user community, and then either people have a crappy experience with
> Spark because their corporate IT doesn't yet have an environment that can
> run the latest version, or worse, they create more maintenance burden for us
> because they ask for more patches to be backported to old Spark versions
> (1.6.x, 2.0.x, etc). Python in particular is pretty fundamental to many
> Linux distros.
>
> In the future, rather than just looking at when some software came out, it
> may be good to have some criteria for when to drop support for something.
> For example, if there are really nice libraries in Python 2.7 or Java 8 that
> we're missing out on, that may be a good reason. The maintenance burden for
> multiple Scala versions is definitely painful but I also think we should
> always support the latest two Scala releases.
>
> Matei
>
> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
>
> I created a JIRA ticket to track this:
> https://issues.apache.org/jira/browse/SPARK-18138
>
>
>
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
> wrote:
>>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>>>
>>> that sounds good to me
>>>
>>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:

 We can do the following concrete proposal:

 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0
 (Mar/Apr 2017).

 2. In Spark 2.1.0 release, aggressively and explicitly announce the
 deprecation of Java 7 / Scala 2.10 support.

 (a) It should appear in release notes, documentations that mention how
 to build Spark

 (b) and a warning should be shown every time SparkContext is started
 using Scala 2.10 or Java 7.

>>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Adding third party jars to classpath used by pyspark

2014-12-30 Thread Davies Liu
On Mon, Dec 29, 2014 at 7:39 PM, Jeremy Freeman
 wrote:
> Hi Stephen, it should be enough to include
>
>> --jars /path/to/file.jar
>
> in the command line call to either pyspark or spark-submit, as in
>
>> spark-submit --master local --jars /path/to/file.jar myfile.py

Unfortunately, you also need '--driver-class-path /path/to/file.jar'
to make it accessible in driver. (This may be fixed in 1.3).

> and you can check the bottom of the Web UI’s “Environment" tab to make sure 
> the jar gets on your classpath. Let me know if you still see errors related 
> to this.
>
> — Jeremy
>
> -
> jeremyfreeman.net
> @thefreemanlab
>
> On Dec 29, 2014, at 7:55 PM, Stephen Boesch  wrote:
>
>> What is the recommended way to do this?  We have some native database
>> client libraries for which we are adding pyspark bindings.
>>
>> The pyspark invokes spark-submit.   Do we add our libraries to
>> the SPARK_SUBMIT_LIBRARY_PATH ?
>>
>> This issue relates back to an error we have been seeing "Py4jError: Trying
>> to call a package" - the suspicion being that the third party libraries may
>> not be available on the jvm side.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Help, pyspark.sql.List flatMap results become tuple

2014-12-30 Thread Davies Liu
This should be fixed in 1.2, could you try it?

On Mon, Dec 29, 2014 at 8:04 PM, guoxu1231  wrote:
> Hi pyspark guys,
>
> I have a json file, and its struct like below:
>
> {"NAME":"George", "AGE":35, "ADD_ID":1212, "POSTAL_AREA":1,
> "TIME_ZONE_ID":1, "INTEREST":[{"INTEREST_NO":1, "INFO":"x"},
> {"INTEREST_NO":2, "INFO":"y"}]}
> {"NAME":"John", "AGE":45, "ADD_ID":1213, "POSTAL_AREA":1, "TIME_ZONE_ID":1,
> "INTEREST":[{"INTEREST_NO":2, "INFO":"x"}, {"INTEREST_NO":3, "INFO":"y"}]}
>
> I'm using spark sql api to manipulate the json data in pyspark shell,
>
> *sqlContext = SQLContext(sc)
> A400= sqlContext.jsonFile('jason_file_path')*
> /Row(ADD_ID=1212, AGE=35, INTEREST=[Row(INFO=u'x', INTEREST_NO=1),
> Row(INFO=u'y', INTEREST_NO=2)], NAME=u'George', POSTAL_AREA=1,
> TIME_ZONE_ID=1)
> Row(ADD_ID=1213, AGE=45, INTEREST=[Row(INFO=u'x', INTEREST_NO=2),
> Row(INFO=u'y', INTEREST_NO=3)], NAME=u'John', POSTAL_AREA=1,
> TIME_ZONE_ID=1)/
> *X = A400.flatMap(lambda i: i.INTEREST)*
> The flatMap results like below, each element in json array were flatten to
> tuple, not my expected  pyspark.sql.Row. I can only access the flatten
> results by index. but it supposed to be flatten to Row(namedTuple) and
> support to access by name.
> (u'x', 1)
> (u'y', 2)
> (u'x', 2)
> (u'y', 3)
>
> My spark version is 1.1.
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Help-pyspark-sql-List-flatMap-results-become-tuple-tp9961.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python to Java object conversion of numpy array

2015-01-09 Thread Davies Liu
Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew  wrote:
> Hi,
> I am trying to send a numpy array as an argument to a function predict() in
> a class in spark/python/pyspark/mllib/clustering.py which is passed to the
> function callMLlibFunc(name, *args)  in
> spark/python/pyspark/mllib/common.py.
>
> Now the value is passed to the function  _py2java(sc, obj) .Here I am
> getting an exception
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.mllib.api.python.SerDe.loads.
> : net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>
>
> Why common._py2java(sc, obj) is not handling numpy array type?
>
> Please help..
>
>
> --
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> www.flytxt.com | Visit our blog  | Follow us
>  | _Connect on Linkedin
> _
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python to Java object conversion of numpy array

2015-01-11 Thread Davies Liu
Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew  wrote:
> Hi,
> Thanks Davies .
>
> I added a new class GaussianMixtureModel in clustering.py and the method
> predict in it and trying to pass numpy array from this method.I converted it
> to DenseVector and its solved now.
>
> Similarly I tried passing a List  of more than one dimension to the function
> _py2java , but now the exception is
>
> 'list' object has no attribute '_get_object_id'
>
> and when I give a tuple input (Vectors.dense([0.8786,
> -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like
>
> 'numpy.ndarray' object has no attribute '_get_object_id'
>
> Regards,
>
>
>
> Meethu Mathew
>
> Engineer
>
> Flytxt
>
> www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin
>
>
>
> On Friday 09 January 2015 11:37 PM, Davies Liu wrote:
>
> Hey Meethu,
>
> The Java API accepts only Vector, so you should convert the numpy array into
> pyspark.mllib.linalg.DenseVector.
>
> BTW, which class are you using? the KMeansModel.predict() accept
> numpy.array,
> it will do the conversion for you.
>
> Davies
>
> On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew 
> wrote:
>
> Hi,
> I am trying to send a numpy array as an argument to a function predict() in
> a class in spark/python/pyspark/mllib/clustering.py which is passed to the
> function callMLlibFunc(name, *args)  in
> spark/python/pyspark/mllib/common.py.
>
> Now the value is passed to the function  _py2java(sc, obj) .Here I am
> getting an exception
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.mllib.api.python.SerDe.loads.
> : net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>
>
> Why common._py2java(sc, obj) is not handling numpy array type?
>
> Please help..
>
>
> --
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us
> <http://www.twitter.com/flytxt> | _Connect on Linkedin
> <http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python to Java object conversion of numpy array

2015-01-12 Thread Davies Liu
On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
 wrote:
> Hi,
>
> This is the code I am running.
>
> mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))
>
> membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector),
> mu)

What's the Java API looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?

> Regards,
> Meethu
> On Monday 12 January 2015 11:46 AM, Davies Liu wrote:
>
> Could you post a piece of code here?
>
> On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew 
> wrote:
>
> Hi,
> Thanks Davies .
>
> I added a new class GaussianMixtureModel in clustering.py and the method
> predict in it and trying to pass numpy array from this method.I converted it
> to DenseVector and its solved now.
>
> Similarly I tried passing a List  of more than one dimension to the function
> _py2java , but now the exception is
>
> 'list' object has no attribute '_get_object_id'
>
> and when I give a tuple input (Vectors.dense([0.8786,
> -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like
>
> 'numpy.ndarray' object has no attribute '_get_object_id'
>
> Regards,
>
>
>
> Meethu Mathew
>
> Engineer
>
> Flytxt
>
> www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin
>
>
>
> On Friday 09 January 2015 11:37 PM, Davies Liu wrote:
>
> Hey Meethu,
>
> The Java API accepts only Vector, so you should convert the numpy array into
> pyspark.mllib.linalg.DenseVector.
>
> BTW, which class are you using? the KMeansModel.predict() accept
> numpy.array,
> it will do the conversion for you.
>
> Davies
>
> On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew 
> wrote:
>
> Hi,
> I am trying to send a numpy array as an argument to a function predict() in
> a class in spark/python/pyspark/mllib/clustering.py which is passed to the
> function callMLlibFunc(name, *args)  in
> spark/python/pyspark/mllib/common.py.
>
> Now the value is passed to the function  _py2java(sc, obj) .Here I am
> getting an exception
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.mllib.api.python.SerDe.loads.
> : net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>
>
> Why common._py2java(sc, obj) is not handling numpy array type?
>
> Please help..
>
>
> --
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us
> <http://www.twitter.com/flytxt> | _Connect on Linkedin
> <http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python to Java object conversion of numpy array

2015-01-13 Thread Davies Liu
On Mon, Jan 12, 2015 at 8:14 PM, Meethu Mathew  wrote:
> Hi,
>
> This is the function defined in PythonMLLibAPI.scala
> def findPredict(
>   data: JavaRDD[Vector],
>   wt: Object,
>   mu: Array[Object],
>   si: Array[Object]):  RDD[Array[Double]]  = {
> }
>
> So the parameter mu should be converted to Array[object].
>
> mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))
>
> def _py2java(sc, obj):
>
> if isinstance(obj, RDD):
> ...
> elif isinstance(obj, SparkContext):
>   ...
> elif isinstance(obj, dict):
>...
> elif isinstance(obj, (list, tuple)):
> obj = ListConverter().convert(obj, sc._gateway._gateway_client)
> elif isinstance(obj, JavaObject):
> pass
> elif isinstance(obj, (int, long, float, bool, basestring)):
> pass
> else:
> bytes = bytearray(PickleSerializer().dumps(obj))
> obj = sc._jvm.SerDe.loads(bytes)
> return obj
>
> Since its a tuple of Densevectors, in _py2java() its entering the
> isinstance(obj, (list, tuple)) condition and throwing exception(happens
> because the dimension of tuple >1). However the conversion occurs correctly
> if the Pickle conversion is done (last else part).

I see, we should remove the special case for list and tuple, pickle should work
more reliably for them. I had tried to remove it, it did not break any tests.

Could you do it in your PR or I create a PR for it separately?

> Hope its clear now.
>
> Regards,
> Meethu
>
> On Monday 12 January 2015 11:35 PM, Davies Liu wrote:
>
> On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
>  wrote:
>
> Hi,
>
> This is the code I am running.
>
> mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))
>
> membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector),
> mu)
>
> What's the Java API looks like? all the arguments of findPredict
> should be converted
> into java objects, so what should `mu` be converted to?
>
> Regards,
> Meethu
> On Monday 12 January 2015 11:46 AM, Davies Liu wrote:
>
> Could you post a piece of code here?
>
> On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew 
> wrote:
>
> Hi,
> Thanks Davies .
>
> I added a new class GaussianMixtureModel in clustering.py and the method
> predict in it and trying to pass numpy array from this method.I converted it
> to DenseVector and its solved now.
>
> Similarly I tried passing a List  of more than one dimension to the function
> _py2java , but now the exception is
>
> 'list' object has no attribute '_get_object_id'
>
> and when I give a tuple input (Vectors.dense([0.8786,
> -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like
>
> 'numpy.ndarray' object has no attribute '_get_object_id'
>
> Regards,
>
>
>
> Meethu Mathew
>
> Engineer
>
> Flytxt
>
> www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin
>
>
>
> On Friday 09 January 2015 11:37 PM, Davies Liu wrote:
>
> Hey Meethu,
>
> The Java API accepts only Vector, so you should convert the numpy array into
> pyspark.mllib.linalg.DenseVector.
>
> BTW, which class are you using? the KMeansModel.predict() accept
> numpy.array,
> it will do the conversion for you.
>
> Davies
>
> On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew 
> wrote:
>
> Hi,
> I am trying to send a numpy array as an argument to a function predict() in
> a class in spark/python/pyspark/mllib/clustering.py which is passed to the
> function callMLlibFunc(name, *args)  in
> spark/python/pyspark/mllib/common.py.
>
> Now the value is passed to the function  _py2java(sc, obj) .Here I am
> getting an exception
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.mllib.api.python.SerDe.loads.
> : net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>
>
> Why common._py2java(sc, obj) is not handling numpy array type?
>
> Please help..
>
>
> --
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us
> <http://www.twitter.com/flytxt> | _Connect on Linkedin
> <http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Davies Liu
It's not necessary, I will create a PR to remove them.

For larger dict/list/tuple, the pickle approach may have less RPC
calls, better performance.

Davies

On Tue, Jan 13, 2015 at 4:53 AM, Meethu Mathew  wrote:
> Hi all,
>
> In the python object to java conversion done in the method _py2java in
> spark/python/pyspark/mllib/common.py, why  we are doing individual
> conversion  using MpaConverter,ListConverter? The same can be acheived using
>
> bytearray(PickleSerializer().dumps(obj))
> obj = sc._jvm.SerDe.loads(bytes)
>
> Is there any performance gain or something in using individual converters
> rather than PickleSerializer?
>
> --
>
> Regards,
>
> *Meethu*

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
Hey,

Without having Python as fast as Scala/Java, I think it's impossible to similar
performance in PySpark as in Scala/Java. Jython is also much slower than
Scala/Java.

With Jython, we can avoid the cost of manage multiple process and RPC,
we may still need to do the data conversion between Java and Python.
Given that fact that Jython is not widely used in production, it may introduce
more troubles than the performance gain.

Spark jobs can be easily speed up by scaling out (by adding more resources).
I think the most advantage of PySpark is that it let you do fast prototype.
Once you got your ETL finalized, it's not that hard to translate your
pure Python
jobs into Scala to reduce the cost(it's optional).

Now days, engineer time is much more expensive than CPU time, I think we
should be more focus on the former.

That's my 2 cents.

Davies

On Thu, Jan 29, 2015 at 12:45 PM, rtshadow
 wrote:
> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 1 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658,
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
The CallbackServer is part of Py4j, it's only used in driver, not used
in slaves or workers.

On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
 wrote:
> Hi all,
>
> I am reading the code of PySpark and its Streaming module.
>
> In PySpark Streaming, when the `compute` method of the instance of
> PythonTransformedDStream is invoked, a connection to the CallbackServer
> is created internally.
> I wonder where is the CallbackServer for each PythonTransformedDStream
> instance on the slave nodes in distributed environment.
> Is there a CallbackServer running on every slave node?
>
> thanks
> Todd

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
Yes.

On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao  wrote:
> Thanks Davies.
> I am not quite familiar with Spark Streaming. Do you mean that the compute
> routine of DStream is only invoked in the driver node,
> while only the compute routines of RDD are distributed to the slaves?
>
> On Thu, Feb 12, 2015 at 2:38 AM, Davies Liu  wrote:
>>
>> The CallbackServer is part of Py4j, it's only used in driver, not used
>> in slaves or workers.
>>
>> On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
>>  wrote:
>> > Hi all,
>> >
>> > I am reading the code of PySpark and its Streaming module.
>> >
>> > In PySpark Streaming, when the `compute` method of the instance of
>> > PythonTransformedDStream is invoked, a connection to the CallbackServer
>> > is created internally.
>> > I wonder where is the CallbackServer for each PythonTransformedDStream
>> > instance on the slave nodes in distributed environment.
>> > Is there a CallbackServer running on every slave node?
>> >
>> > thanks
>> > Todd
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: functools.partial as UserDefinedFunction

2015-03-25 Thread Davies Liu
It’s good to support functools.partial, could you file a JIRA for it?


On Wednesday, March 25, 2015 at 5:42 AM, Karlson wrote:

>  
> Hi all,
>  
> passing a functools.partial-function as a UserDefinedFunction to  
> DataFrame.select raises an AttributeException, because functools.partial  
> does not have the attribute __name__. Is there any alternative to  
> relying on __name__ in pyspark/sql/functions.py:126 ?
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>  




Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
see https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools

On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch  wrote:
> I am iteratively making changes to the scala side of some new pyspark code
> and re-testing from the python/pyspark side.
>
> Presently my only solution is to rebuild completely
>
>   sbt assembly
>
> after any scala side change - no matter how small.
>
> Any better / expedited way for pyspark to see small scala side updates?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
I usually just open a terminal to do `build/sbt ~compile`, coding in
IntelliJ, then run python tests in another terminal once it compiled
successfully.

On Fri, Mar 27, 2015 at 10:11 AM, Reynold Xin  wrote:
> Python is tough if you need to change Scala at the same time.
>
> sbt/sbt assembly/assembly
>
> can be slightly faster than just assembly.
>
>
> On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch  wrote:
>
>> I am iteratively making changes to the scala side of some new pyspark code
>> and re-testing from the python/pyspark side.
>>
>> Presently my only solution is to rebuild completely
>>
>>   sbt assembly
>>
>> after any scala side change - no matter how small.
>>
>> Any better / expedited way for pyspark to see small scala side updates?
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
put these lines in your ~/.bash_profile

export SPARK_PREPEND_CLASSES=true
export SPARK_HOME=path_to_spark
export 
PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip:${SPARK_HOME}/python:${PYTHONPATH}"

$ source ~/.bash_profile
$ build/sbt assembly
$ build/sbt ~compile  # do not stop this

Then in another terminal you could run python tests as
$ cd python/pyspark/
$  python rdd.py


cc to dev list


On Fri, Mar 27, 2015 at 10:15 AM, Stephen Boesch  wrote:
> Which aspect of that page are you suggesting provides a more optimized
> alternative?
>
> 2015-03-27 10:13 GMT-07:00 Davies Liu :
>
>> see
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
>>
>> On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch 
>> wrote:
>> > I am iteratively making changes to the scala side of some new pyspark
>> > code
>> > and re-testing from the python/pyspark side.
>> >
>> > Presently my only solution is to rebuild completely
>> >
>> >   sbt assembly
>> >
>> > after any scala side change - no matter how small.
>> >
>> > Any better / expedited way for pyspark to see small scala side updates?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Davies Liu
On Fri, Mar 27, 2015 at 4:16 PM, Stephen Boesch  wrote:
> Thx much!  This works.
>
> My workflow is making changes to files in Intelij and running ipython to
> execute pyspark.
>
> Is there any way for ipython to "see the updated class files without first
> exiting?

No, iPython shell is statefull, it will have unexpected behavior when
you reload the library.

> 2015-03-27 10:21 GMT-07:00 Davies Liu :
>
>> put these lines in your ~/.bash_profile
>>
>> export SPARK_PREPEND_CLASSES=true
>> export SPARK_HOME=path_to_spark
>> export
>> PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip:${SPARK_HOME}/python:${PYTHONPATH}"
>>
>> $ source ~/.bash_profile
>> $ build/sbt assembly
>> $ build/sbt ~compile  # do not stop this
>>
>> Then in another terminal you could run python tests as
>> $ cd python/pyspark/
>> $  python rdd.py
>>
>>
>> cc to dev list
>>
>>
>> On Fri, Mar 27, 2015 at 10:15 AM, Stephen Boesch 
>> wrote:
>> > Which aspect of that page are you suggesting provides a more optimized
>> > alternative?
>> >
>> > 2015-03-27 10:13 GMT-07:00 Davies Liu :
>> >
>> >> see
>> >>
>> >> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
>> >>
>> >> On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch 
>> >> wrote:
>> >> > I am iteratively making changes to the scala side of some new pyspark
>> >> > code
>> >> > and re-testing from the python/pyspark side.
>> >> >
>> >> > Presently my only solution is to rebuild completely
>> >> >
>> >> >   sbt assembly
>> >> >
>> >> > after any scala side change - no matter how small.
>> >> >
>> >> > Any better / expedited way for pyspark to see small scala side
>> >> > updates?
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Haskell language Spark support

2015-04-03 Thread Davies Liu
The PR for integrate SparkR into Spark may help: 
https://github.com/apache/spark/pull/5096 

-- 
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, March 25, 2015 at 7:35 PM, danilo2 wrote:

> Hi!
> I'm a haskell developer and I have created many haskell libraries in my life
> and some GHC extensions.
> I would like to create Haskell binding for Spark. Where can I find any
> documentation / sources describing the first steps in creation of a new
> language binding?
> 
> I would be very thankful for any help! :)
> 
> Al the best,
> Wojciech
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Haskell-language-Spark-support-tp11257.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
> 
> 




Re: Query regarding infering data types in pyspark

2015-04-10 Thread Davies Liu
What's the format you have in json file?

On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya  wrote:
> Hi,
>
> In pyspark when if I read a json file using sqlcontext I find that the date
> field is not infered as date instead it is converted to string. And when I
> try to convert it to date using df.withColumn(df.DateCol.cast("timestamp"))
> it does not parse it successfuly and adds a null instead there. Should I
> use UDF to convert the date ? Is this expected behaviour (not throwing an
> error after failure to cast all fields)?
>
> --
> Regards,
> Suraj

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Query regarding infering data types in pyspark

2015-04-13 Thread Davies Liu
Hey Suraj,

You should use "date" for DataType:

df.withColumn(df.DateCol.cast("date"))

Davies

On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya  wrote:
> Humble reminder
>
> On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya 
> wrote:
>>
>> Hi,
>>
>> Below is one line from the json file.
>> I have highlighted the field that represents the date.
>>
>>
>> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45,"DEST_STATE_NM":"South
>> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> 47":null}
>>
>> Please let me know if you need access to the dataset.
>>
>> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu 
>> wrote:
>>>
>>> What's the format you have in json file?
>>>
>>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya 
>>> wrote:
>>> > Hi,
>>> >
>>> > In pyspark when if I read a json file using sqlcontext I find that the
>>> > date
>>> > field is not infered as date instead it is converted to string. And
>>> > when I
>>> > try to convert it to date using
>>> > df.withColumn(df.DateCol.cast("timestamp"))
>>> > it does not parse it successfuly and adds a null instead there. Should
>>> > I
>>> > use UDF to convert the date ? Is this expected behaviour (not throwing
>>> > an
>>> > error after failure to cast all fields)?
>>> >
>>> > --
>>> > Regards,
>>> > Suraj
>>
>>
>>
>>
>> --
>> Regards,
>> Suraj
>
>
>
>
> --
> Regards,
> Suraj

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python & more)

2015-04-14 Thread Davies Liu
Hey Shane,

Have you updated all the jenkins slaves?

There is a run with old configurations (no Python 3, with 130 minutes
timeout), see 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/666/consoleFull

Davies

On Thu, Apr 9, 2015 at 10:18 AM, shane knapp  wrote:
> ok, we're looking good.  i'll keep an eye on this for the rest of the day,
> and if you happen to notice any infrastructure failures before i do (i
> updated a LOT), please let me know immediately!  :)
>
> On Thu, Apr 9, 2015 at 8:38 AM, shane knapp  wrote:
>
>> things are looking pretty good and i expect to be done within an hour.
>>  i've got some test builds running right now, and will give the green light
>> when they successfully complete.
>>
>> On Thu, Apr 9, 2015 at 7:29 AM, shane knapp  wrote:
>>
>>> and this is now happening.
>>>
>>> On Tue, Apr 7, 2015 at 4:38 PM, shane knapp  wrote:
>>>
>>>> reminder!  this is happening thurday morning.
>>>>
>>>> On Fri, Apr 3, 2015 at 9:59 AM, shane knapp  wrote:
>>>>
>>>>> welcome to python2.7+, java 8 and more!  :)
>>>>>
>>>>> i'll be doing a major upgrade to our build system next thursday
>>>>> morning.  here's a quick list of what's going on:
>>>>>
>>>>> * installation of anaconda python on all worker nodes
>>>>>
>>>>> * installation of pypy 2.5.1 (python 2.7) on all nodes
>>>>>
>>>>> * matching installation of python modules for the current system python
>>>>> (2.6), and anaconda python (2.6, 2.7 and 3.4)
>>>>>   - anaconda python 2.7 will be the default for all workers (this has
>>>>> stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
>>>>> and i've noticed no test failures)
>>>>>   - you can now use anaconda environments to specify which version of
>>>>> python to use in your tests:  http://www.continuum.io/blog/conda
>>>>>
>>>>> * installation of new python 2.7 modules:  pymongo requests six pymongo
>>>>> requests six python-crontab
>>>>>
>>>>> * bare-bones mongodb installation on all workers
>>>>>
>>>>> * installation of java 1.6 and 1.8 internal to jenkins
>>>>>   - jobs will default to the system java, which is 1.7.0_75
>>>>>   - if you want to run your tests w/java 6 or 8, you can select the JDK
>>>>> version of your choice in the job configuration page (it'll be towards the
>>>>> top)
>>>>>
>>>>> these changes have actually all been tested against a variety of builds
>>>>> (yay staging!) and while i'm certain that i have all of the kinks worked
>>>>> out, i'm going to schedule a longer downtime so that i have a chance to
>>>>> identify and squash any problems that surface.
>>>>>
>>>>> thanks to josh rosen, k. shankari and davies liu for helping me test
>>>>> all of this and get it working.
>>>>>
>>>>> shane
>>>>>
>>>>
>>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Query regarding infering data types in pyspark

2015-04-15 Thread Davies Liu
It does not work now, could you file a jira for it?

On Wed, Apr 15, 2015 at 9:29 AM, Suraj Shetiya  wrote:
> Thank you :)
>
> That worked. I had another query regarding date being used as filter.
>
> With the new df which has the column cast as date I am unable to apply a
> filter that compares the dates.
> The query I am using is :
> df.filter(df.Datecol > datetime.date(2015,1,1)).show()
>
> I do not want to use date as a string to compare them. Please suggest.
>
>
> On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu  wrote:
>>
>> Hey Suraj,
>>
>> You should use "date" for DataType:
>>
>> df.withColumn(df.DateCol.cast("date"))
>>
>> Davies
>>
>> On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya 
>> wrote:
>> > Humble reminder
>> >
>> > On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Below is one line from the json file.
>> >> I have highlighted the field that represents the date.
>> >>
>> >>
>> >>
>> >> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> >>
>> >> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> >>
>> >> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> >> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> >>
>> >> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> >> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> >> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45
>> >> ,"DEST_STATE_NM":"South
>> >>
>> >> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> >>
>> >> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> >>
>> >> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> >> 47":null}
>> >>
>> >> Please let me know if you need access to the dataset.
>> >>
>> >> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu 
>> >> wrote:
>> >>>
>> >>> What's the format you have in json file?
>> >>>
>> >>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya
>> >>> 
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > In pyspark when if I read a json file using sqlcontext I find that
>> >>> > the
>> >>> > date
>> >>> > field is not infered as date instead it is converted to string. And
>> >>> > when I
>> >>> > try to convert it to date using
>> >>> > df.withColumn(df.DateCol.cast("timestamp"))
>> >>> > it does not parse it successfuly and adds a null instead there.
>> >>> > Should
>> >>> > I
>> >>> > use UDF to convert the date ? Is this expected behaviour (not
>> >>> > throwing
>> >>> > an
>> >>> > error after failure to cast all fields)?
>> >>> >
>> >>> > --
>> >>> > Regards,
>> >>> > Suraj
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Suraj
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Suraj
>
>
>
>
> --
> Regards,
> Suraj

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



回复: [PySpark DataFrame] When a Row is not a Row

2015-05-12 Thread Davies Liu
The class (called Row) for rows from Spark SQL is created on the fly, is 
different from pyspark.sql.Row (is an public API to create Row by users).  

The reason we done it in this way is that we want to have better performance 
when accessing the columns. Basically, the rows are just named tuples (called 
`Row`).  

--  
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

已使用 Sparrow (http://www.sparrowmailapp.com/?sig)  

在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道:

> This is really strange.
>  
> > > > # Spark 1.3.1
> > > > print type(results)
> > > >  
> > >  
> >  
>  
> 
>  
> > > > a = results.take(1)[0]
>  
> > > > print type(a)
> 
>  
> > > > print pyspark.sql.types.Row
> 
>  
> > > > print type(a) == pyspark.sql.types.Row
> False
> > > > print isinstance(a, pyspark.sql.types.Row)
> > >  
> >  
>  
> False
>  
> If I set a as follows, then the type checks pass fine.
>  
> a = pyspark.sql.types.Row('name')('Nick')
>  
> Is this a bug? What can I do to narrow down the source?
>  
> results is a massive DataFrame of spark-perf results.
>  
> Nick
> ​
>  
>  




Re: [SparkR] is toDF() necessary

2015-05-17 Thread Davies Liu
toDF() is first introduced in Scala and Python (because
createDataFrame is too long), is used in lots places, I think it's
useful.

On Fri, May 8, 2015 at 11:03 AM, Shivaram Venkataraman
 wrote:
> Agree that toDF is not very useful. In fact it was removed from the
> namespace in a recent change
> https://github.com/apache/spark/commit/4e930420c19ae7773b138dfc7db8fc03b4660251
>
> Thanks
> Shivaram
>
> On Fri, May 8, 2015 at 1:10 AM, Sun, Rui  wrote:
>
>> toDF() is defined to convert an RDD to a DataFrame. But it is just a very
>> thin wrapper of createDataFrame() by help the caller avoid input of
>> SQLContext.
>>
>> Since Scala/pySpark does not have toDF(), and we'd better keep API as
>> narrow and simple as possible. Is toDF() really necessary? Could we
>> eliminate it?
>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Tungsten's Vectorized Execution

2015-05-21 Thread Davies Liu
We have not start to prototype the vectorized one yet, will evaluated
in 1.5 and may targeted for 1.6.

We're glad to hear some feedback/suggestions/comments from your side!

On Thu, May 21, 2015 at 9:37 AM, Yijie Shen  wrote:
> Hi all,
>
> I’ve seen the Blog of Project Tungsten here, it sounds awesome to me!
>
> I’ve also noticed there is a plan to change the code generation from
> record-at-a-time evaluation to a vectorized one, which interests me most.
>
> What’s the status of vectorized evaluation?  Is this an inner effort of
> Databricks or welcome to be involved?
>
> Since I’ve done similar stuffs on Spark SQL, I would like to get involved if
> that’s possible.
>
>
> Yours,
>
> Yijie

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
There is a module called 'types' in python 3:

davies@localhost:~/work/spark$ python3
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import types
>>> types


Without renaming, our `types.py` will conflict with it when you run
unittests in pyspark/sql/ .

On Tue, May 26, 2015 at 11:57 AM, Justin Uang  wrote:
> In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
> renamed to pyspark/sql/_types.py and then some magic in
> pyspark/sql/__init__.py dynamically renamed the module back to types. I
> imagine that this is some naming conflict with Python 3, but what was the
> error that showed up?
>
> The reason why I'm asking about this is because it's messing with pylint,
> since pylint cannot now statically find the module. I tried also importing
> the package so that __init__ would be run in a init-hook, but that isn't
> what the discovery mechanism is using. I imagine it's probably just crawling
> the directory structure.
>
> One way to work around this would be something akin to this
> (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
> where I would have to create a fake module, but I would probably be missing
> a ton of pylint features on users of that module, and it's pretty hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
When you run the test in python/pyspark/sql/ by

bin/spark-submit python/pyspark/sql/dataframe.py

the the current directory is the first item in sys.path, sql/types.py
will have higher priority then python3.4/types.py, the tests will
fail.

On Tue, May 26, 2015 at 12:08 PM, Justin Uang  wrote:
> Thanks for clarifying! I don't understand python package and modules names
> that well, but I thought that the package namespacing would've helped, since
> you are in pyspark.sql.types. I guess not?
>
> On Tue, May 26, 2015 at 3:03 PM Davies Liu  wrote:
>>
>> There is a module called 'types' in python 3:
>>
>> davies@localhost:~/work/spark$ python3
>> Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
>> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import types
>> >>> types
>> >
>> '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'>
>>
>> Without renaming, our `types.py` will conflict with it when you run
>> unittests in pyspark/sql/ .
>>
>> On Tue, May 26, 2015 at 11:57 AM, Justin Uang 
>> wrote:
>> > In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
>> > renamed to pyspark/sql/_types.py and then some magic in
>> > pyspark/sql/__init__.py dynamically renamed the module back to types. I
>> > imagine that this is some naming conflict with Python 3, but what was
>> > the
>> > error that showed up?
>> >
>> > The reason why I'm asking about this is because it's messing with
>> > pylint,
>> > since pylint cannot now statically find the module. I tried also
>> > importing
>> > the package so that __init__ would be run in a init-hook, but that isn't
>> > what the discovery mechanism is using. I imagine it's probably just
>> > crawling
>> > the directory structure.
>> >
>> > One way to work around this would be something akin to this
>> >
>> > (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
>> > where I would have to create a fake module, but I would probably be
>> > missing
>> > a ton of pylint features on users of that module, and it's pretty hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
I think relative imports can not help in this case.

When you run scripts in pyspark/sql, it doesn't know anything about
pyspark.sql, it
just see types.py as a separate module.

On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal
 wrote:
> Davies: Can we use relative imports (import .types) in the unit tests in
> order to disambiguate between the global and local module?
>
> Punya
>
> On Tue, May 26, 2015 at 3:09 PM Justin Uang  wrote:
>>
>> Thanks for clarifying! I don't understand python package and modules names
>> that well, but I thought that the package namespacing would've helped, since
>> you are in pyspark.sql.types. I guess not?
>>
>> On Tue, May 26, 2015 at 3:03 PM Davies Liu  wrote:
>>>
>>> There is a module called 'types' in python 3:
>>>
>>> davies@localhost:~/work/spark$ python3
>>> Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
>>> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import types
>>> >>> types
>>> >>
>>> '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'>
>>>
>>> Without renaming, our `types.py` will conflict with it when you run
>>> unittests in pyspark/sql/ .
>>>
>>> On Tue, May 26, 2015 at 11:57 AM, Justin Uang 
>>> wrote:
>>> > In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
>>> > renamed to pyspark/sql/_types.py and then some magic in
>>> > pyspark/sql/__init__.py dynamically renamed the module back to types. I
>>> > imagine that this is some naming conflict with Python 3, but what was
>>> > the
>>> > error that showed up?
>>> >
>>> > The reason why I'm asking about this is because it's messing with
>>> > pylint,
>>> > since pylint cannot now statically find the module. I tried also
>>> > importing
>>> > the package so that __init__ would be run in a init-hook, but that
>>> > isn't
>>> > what the discovery mechanism is using. I imagine it's probably just
>>> > crawling
>>> > the directory structure.
>>> >
>>> > One way to work around this would be something akin to this
>>> >
>>> > (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
>>> > where I would have to create a fake module, but I would probably be
>>> > missing
>>> > a ton of pylint features on users of that module, and it's pretty
>>> > hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having
ForkingIterator. If we have unlimited buffer in it, then will not have
the problem of deadlock, I think. The writing thread will be blocked
by Python process, so there will be not much rows be buffered(still be
a reason to OOM). At least, this approach is better than current one.

Could you create a JIRA and sending out the PR?

On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang  wrote:
> BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but
> I have a proof-of-concept implementation that avoids caching the entire
> dataset.
>
> Hi,
>
> We have been running into performance problems using Python UDFs with
> DataFrames at large scale.
>
> From the implementation of BatchPythonEvaluation, it looks like the goal was
> to reuse the PythonRDD code. It caches the entire child RDD so that it can
> do two passes over the data. One to give to the PythonRDD, then one to join
> the python lambda results with the original row (which may have java objects
> that should be passed through).
>
> In addition, it caches all the columns, even the ones that don't need to be
> processed by the Python UDF. In the cases I was working with, I had a 500
> column table, and i wanted to use a python UDF for one column, and it ended
> up caching all 500 columns.
>
> I have a working solution over here that does it in one pass over the data,
> avoiding caching
> (https://github.com/justinuang/spark/commit/c1a415a18d31226ac580f1a9df7985571d03199b).
> With this patch, I go from a job that takes 20 minutes then OOMs, to a job
> that finishes completely in 3 minutes. It is indeed quite hacky and prone to
> deadlocks since there is buffering in many locations:
>
> - NEW: the ForkingIterator LinkedBlockingDeque
> - batching the rows before pickling them
> - os buffers on both sides
> - pyspark.serializers.BatchedSerializer
>
> We can avoid deadlock by being very disciplined. For example, we can have
> the ForkingIterator instead always do a check of whether the
> LinkedBlockingDeque is full and if so:
>
> Java
> - flush the java pickling buffer
> - send a flush command to the python process
> - os.flush the java side
>
> Python
> - flush BatchedSerializer
> - os.flush()
>
> I haven't added this yet. This is getting very complex however. Another
> model would just be to change the protocol between the java side and the
> worker to be a synchronous request/response. This has the disadvantage that
> the CPU isn't doing anything when the batch is being sent across, but it has
> the huge advantage of simplicity. In addition, I imagine that the actual IO
> between the processes isn't that slow, but rather the serialization of java
> objects into pickled bytes, and the deserialization/serialization + python
> loops on the python side. Another advantage is that we won't be taking more
> than 100% CPU since only one thread is doing CPU work at a time between the
> executor and the python interpreter.
>
> Any thoughts would be much appreciated =)
>
> Other improvements:
> - extract some code of the worker out of PythonRDD so that we can do a
> mapPartitions directly in BatchedPythonEvaluation without resorting to the
> hackery in ForkedRDD.compute(), which uses a cache to ensure that the other
> RDD can get a handle to the same iterator.
> - read elements and use a size estimator to create the BlockingQueue to
> make sure that we don't store too many things in memory when batching
> - patch Unpickler to not use StopException for control flow, which is
> slowing down the java side
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Fare points, I also like simpler solutions.

The overhead of Python task could be a few of milliseconds, which
means we also should eval them as batches (one Python task per batch).

Decreasing the batch size for UDF sounds reasonable to me, together
with other tricks to reduce the data in socket/pipe buffer.

BTW, what do your UDF looks like? How about to use Jython to run
simple Python UDF (without some external libraries).

On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang  wrote:
> // + punya
>
> Thanks for your quick response!
>
> I'm not sure that using an unbounded buffer is a good solution to the
> locking problem. For example, in the situation where I had 500 columns, I am
> in fact storing 499 extra columns on the java side, which might make me OOM
> if I have to store many rows. In addition, if I am using an
> AutoBatchedSerializer, the java side might have to write 1 << 16 == 65536
> rows before python starts outputting elements, in which case, the Java side
> has to buffer 65536 complete rows. In general it seems fragile to rely on
> blocking behavior in the Python coprocess. By contrast, it's very easy to
> verify the correctness and performance characteristics of the synchronous
> blocking solution.
>
>
> On Tue, Jun 23, 2015 at 7:21 PM Davies Liu  wrote:
>>
>> Thanks for looking into it, I'd like the idea of having
>> ForkingIterator. If we have unlimited buffer in it, then will not have
>> the problem of deadlock, I think. The writing thread will be blocked
>> by Python process, so there will be not much rows be buffered(still be
>> a reason to OOM). At least, this approach is better than current one.
>>
>> Could you create a JIRA and sending out the PR?
>>
>> On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang 
>> wrote:
>> > BLUF: BatchPythonEvaluation's implementation is unusable at large scale,
>> > but
>> > I have a proof-of-concept implementation that avoids caching the entire
>> > dataset.
>> >
>> > Hi,
>> >
>> > We have been running into performance problems using Python UDFs with
>> > DataFrames at large scale.
>> >
>> > From the implementation of BatchPythonEvaluation, it looks like the goal
>> > was
>> > to reuse the PythonRDD code. It caches the entire child RDD so that it
>> > can
>> > do two passes over the data. One to give to the PythonRDD, then one to
>> > join
>> > the python lambda results with the original row (which may have java
>> > objects
>> > that should be passed through).
>> >
>> > In addition, it caches all the columns, even the ones that don't need to
>> > be
>> > processed by the Python UDF. In the cases I was working with, I had a
>> > 500
>> > column table, and i wanted to use a python UDF for one column, and it
>> > ended
>> > up caching all 500 columns.
>> >
>> > I have a working solution over here that does it in one pass over the
>> > data,
>> > avoiding caching
>> >
>> > (https://github.com/justinuang/spark/commit/c1a415a18d31226ac580f1a9df7985571d03199b).
>> > With this patch, I go from a job that takes 20 minutes then OOMs, to a
>> > job
>> > that finishes completely in 3 minutes. It is indeed quite hacky and
>> > prone to
>> > deadlocks since there is buffering in many locations:
>> >
>> > - NEW: the ForkingIterator LinkedBlockingDeque
>> > - batching the rows before pickling them
>> > - os buffers on both sides
>> > - pyspark.serializers.BatchedSerializer
>> >
>> > We can avoid deadlock by being very disciplined. For example, we can
>> > have
>> > the ForkingIterator instead always do a check of whether the
>> > LinkedBlockingDeque is full and if so:
>> >
>> > Java
>> > - flush the java pickling buffer
>> > - send a flush command to the python process
>> > - os.flush the java side
>> >
>> > Python
>> > - flush BatchedSerializer
>> > - os.flush()
>> >
>> > I haven't added this yet. This is getting very complex however. Another
>> > model would just be to change the protocol between the java side and the
>> > worker to be a synchronous request/response. This has the disadvantage
>> > that
>> > the CPU isn't doing anything when the batch is being sent across, but it
>> > has
>> > the huge advantage of simplicity. In addition, I imagine that the actual
>> > IO
>> > bet

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
>From you comment, the 2x improvement only happens when you have the
batch size as 1, right?

On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang  wrote:
> FYI, just submitted a PR to Pyrolite to remove their StopException.
> https://github.com/irmen/Pyrolite/pull/30
>
> With my benchmark, removing it basically made it about 2x faster.
>
> On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal 
> wrote:
>>
>> Hi Davies,
>>
>> In general, do we expect people to use CPython only for "heavyweight" UDFs
>> that invoke an external library? Are there any examples of using Jython,
>> especially performance comparisons to Java/Scala and CPython? When using
>> Jython, do you expect the driver to send code to the executor as a string,
>> or is there a good way to serialized Jython lambdas?
>>
>> (For context, I was unable to serialize Nashorn lambdas when I tried to
>> use them in Spark.)
>>
>> Punya
>> On Wed, Jun 24, 2015 at 2:26 AM Davies Liu  wrote:
>>>
>>> Fare points, I also like simpler solutions.
>>>
>>> The overhead of Python task could be a few of milliseconds, which
>>> means we also should eval them as batches (one Python task per batch).
>>>
>>> Decreasing the batch size for UDF sounds reasonable to me, together
>>> with other tricks to reduce the data in socket/pipe buffer.
>>>
>>> BTW, what do your UDF looks like? How about to use Jython to run
>>> simple Python UDF (without some external libraries).
>>>
>>> On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang 
>>> wrote:
>>> > // + punya
>>> >
>>> > Thanks for your quick response!
>>> >
>>> > I'm not sure that using an unbounded buffer is a good solution to the
>>> > locking problem. For example, in the situation where I had 500 columns,
>>> > I am
>>> > in fact storing 499 extra columns on the java side, which might make me
>>> > OOM
>>> > if I have to store many rows. In addition, if I am using an
>>> > AutoBatchedSerializer, the java side might have to write 1 << 16 ==
>>> > 65536
>>> > rows before python starts outputting elements, in which case, the Java
>>> > side
>>> > has to buffer 65536 complete rows. In general it seems fragile to rely
>>> > on
>>> > blocking behavior in the Python coprocess. By contrast, it's very easy
>>> > to
>>> > verify the correctness and performance characteristics of the
>>> > synchronous
>>> > blocking solution.
>>> >
>>> >
>>> > On Tue, Jun 23, 2015 at 7:21 PM Davies Liu 
>>> > wrote:
>>> >>
>>> >> Thanks for looking into it, I'd like the idea of having
>>> >> ForkingIterator. If we have unlimited buffer in it, then will not have
>>> >> the problem of deadlock, I think. The writing thread will be blocked
>>> >> by Python process, so there will be not much rows be buffered(still be
>>> >> a reason to OOM). At least, this approach is better than current one.
>>> >>
>>> >> Could you create a JIRA and sending out the PR?
>>> >>
>>> >> On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang 
>>> >> wrote:
>>> >> > BLUF: BatchPythonEvaluation's implementation is unusable at large
>>> >> > scale,
>>> >> > but
>>> >> > I have a proof-of-concept implementation that avoids caching the
>>> >> > entire
>>> >> > dataset.
>>> >> >
>>> >> > Hi,
>>> >> >
>>> >> > We have been running into performance problems using Python UDFs
>>> >> > with
>>> >> > DataFrames at large scale.
>>> >> >
>>> >> > From the implementation of BatchPythonEvaluation, it looks like the
>>> >> > goal
>>> >> > was
>>> >> > to reuse the PythonRDD code. It caches the entire child RDD so that
>>> >> > it
>>> >> > can
>>> >> > do two passes over the data. One to give to the PythonRDD, then one
>>> >> > to
>>> >> > join
>>> >> > the python lambda results with the original row (which may have java
>>> >> > objects
>>> >> > that should be passed through).
>>> >> >
>>> >> > In addition,

Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
I'm thinking that the batched synchronous version will be too slow
(with small batch size) or easy to OOM with large (batch size). If
it's not that hard, you can give it a try.

On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang  wrote:
> Correct, I was running with a batch size of about 100 when I did the tests,
> because I was worried about deadlocks. Do you have any concerns regarding
> the batched synchronous version of communication between the Java and Python
> processes, and if not, should I file a ticket and starting writing it?
>
> On Wed, Jun 24, 2015 at 7:27 PM Davies Liu  wrote:
>>
>> From you comment, the 2x improvement only happens when you have the
>> batch size as 1, right?
>>
>> On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang 
>> wrote:
>> > FYI, just submitted a PR to Pyrolite to remove their StopException.
>> > https://github.com/irmen/Pyrolite/pull/30
>> >
>> > With my benchmark, removing it basically made it about 2x faster.
>> >
>> > On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal
>> > 
>> > wrote:
>> >>
>> >> Hi Davies,
>> >>
>> >> In general, do we expect people to use CPython only for "heavyweight"
>> >> UDFs
>> >> that invoke an external library? Are there any examples of using
>> >> Jython,
>> >> especially performance comparisons to Java/Scala and CPython? When
>> >> using
>> >> Jython, do you expect the driver to send code to the executor as a
>> >> string,
>> >> or is there a good way to serialized Jython lambdas?
>> >>
>> >> (For context, I was unable to serialize Nashorn lambdas when I tried to
>> >> use them in Spark.)
>> >>
>> >> Punya
>> >> On Wed, Jun 24, 2015 at 2:26 AM Davies Liu 
>> >> wrote:
>> >>>
>> >>> Fare points, I also like simpler solutions.
>> >>>
>> >>> The overhead of Python task could be a few of milliseconds, which
>> >>> means we also should eval them as batches (one Python task per batch).
>> >>>
>> >>> Decreasing the batch size for UDF sounds reasonable to me, together
>> >>> with other tricks to reduce the data in socket/pipe buffer.
>> >>>
>> >>> BTW, what do your UDF looks like? How about to use Jython to run
>> >>> simple Python UDF (without some external libraries).
>> >>>
>> >>> On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang 
>> >>> wrote:
>> >>> > // + punya
>> >>> >
>> >>> > Thanks for your quick response!
>> >>> >
>> >>> > I'm not sure that using an unbounded buffer is a good solution to
>> >>> > the
>> >>> > locking problem. For example, in the situation where I had 500
>> >>> > columns,
>> >>> > I am
>> >>> > in fact storing 499 extra columns on the java side, which might make
>> >>> > me
>> >>> > OOM
>> >>> > if I have to store many rows. In addition, if I am using an
>> >>> > AutoBatchedSerializer, the java side might have to write 1 << 16 ==
>> >>> > 65536
>> >>> > rows before python starts outputting elements, in which case, the
>> >>> > Java
>> >>> > side
>> >>> > has to buffer 65536 complete rows. In general it seems fragile to
>> >>> > rely
>> >>> > on
>> >>> > blocking behavior in the Python coprocess. By contrast, it's very
>> >>> > easy
>> >>> > to
>> >>> > verify the correctness and performance characteristics of the
>> >>> > synchronous
>> >>> > blocking solution.
>> >>> >
>> >>> >
>> >>> > On Tue, Jun 23, 2015 at 7:21 PM Davies Liu 
>> >>> > wrote:
>> >>> >>
>> >>> >> Thanks for looking into it, I'd like the idea of having
>> >>> >> ForkingIterator. If we have unlimited buffer in it, then will not
>> >>> >> have
>> >>> >> the problem of deadlock, I think. The writing thread will be
>> >>> >> blocked
>> >>> >> by Python process, so there will be not much rows be buffered(still
>> >>> >> be
>> >>> >> a reason t

Re: [PySpark DataFrame] When a Row is not a Row

2015-07-12 Thread Davies Liu
We finally fix this in 1.5 (next release), see
https://github.com/apache/spark/pull/7301

On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam  wrote:
> Hi guys,
>
> I just hit the same problem. It is very confusing when Row is not the same
> Row type at runtime. The worst thing is that when I use Spark in local mode,
> the Row is the same Row type! so it passes the test cases but it fails when
> I deploy the application.
>
> Can someone suggest a workaround?
>
> Best Regards,
>
> Jerry
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
Thanks for reporting this, I'm working on it. It turned out that it's
a bug in when run with Python3.4, will sending out a fix soon.

On Sun, Jul 12, 2015 at 1:33 PM, Cheolsoo Park  wrote:
> Hi devs,
>
> For some reason, I keep getting this test failure (3 out of 4 builds) in my
> PR-
>
> ==
> FAIL: test_time_with_timezone (__main__.SQLTests)
> --
> Traceback (most recent call last):
>   File
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py",
> line 718, in test_time_with_timezone
> self.assertEqual(now, now1)
> AssertionError: datetime.datetime(2015, 7, 12, 13, 18, 46, 504366) !=
> datetime.datetime(2015, 7, 12, 13, 18, 46, 504365)
>
> Jenkins builds-
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37100/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37092/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37081/console
>
> I am aware that there was a hot fix for this test case, and I already have
> it in the commit log-
>
> commit 05ac023dc8d9004a27c2f06ee875b0ff3743ccdd
>
> Author: Davies Liu 
> Date:   Fri Jul 10 13:05:23 2015 -0700
> [HOTFIX] fix flaky test in PySpark SQL
>
> I looked at the test code, and it seems that precision in microseconds is
> lost somewhere in a round trip from Python to DataFrame. Can someone please
> help me debug this error?
>
> Thanks!
> Cheolsoo
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
Will be fixed by https://github.com/apache/spark/pull/7363

On Sun, Jul 12, 2015 at 7:45 PM, Davies Liu  wrote:
> Thanks for reporting this, I'm working on it. It turned out that it's
> a bug in when run with Python3.4, will sending out a fix soon.
>
> On Sun, Jul 12, 2015 at 1:33 PM, Cheolsoo Park  wrote:
>> Hi devs,
>>
>> For some reason, I keep getting this test failure (3 out of 4 builds) in my
>> PR-
>>
>> ==
>> FAIL: test_time_with_timezone (__main__.SQLTests)
>> --
>> Traceback (most recent call last):
>>   File
>> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests.py",
>> line 718, in test_time_with_timezone
>> self.assertEqual(now, now1)
>> AssertionError: datetime.datetime(2015, 7, 12, 13, 18, 46, 504366) !=
>> datetime.datetime(2015, 7, 12, 13, 18, 46, 504365)
>>
>> Jenkins builds-
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37100/console
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37092/console
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37081/console
>>
>> I am aware that there was a hot fix for this test case, and I already have
>> it in the commit log-
>>
>> commit 05ac023dc8d9004a27c2f06ee875b0ff3743ccdd
>>
>> Author: Davies Liu 
>> Date:   Fri Jul 10 13:05:23 2015 -0700
>> [HOTFIX] fix flaky test in PySpark SQL
>>
>> I looked at the test code, and it seems that precision in microseconds is
>> lost somewhere in a round trip from Python to DataFrame. Can someone please
>> help me debug this error?
>>
>> Thanks!
>> Cheolsoo
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).

On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah  wrote:
> Hi everyone,
>
> I was examining the Pyspark implementation of groupByKey in rdd.py. I would
> like to submit a patch improving Scala RDD’s groupByKey that has a similar
> robustness against large groups, as Pyspark’s implementation has logic to
> spill part of a single group to disk along the way.
>
> Its implementation appears to do the following:
>
> Combine and group-by-key per partition locally, potentially spilling
> individual groups to disk
> Shuffle the data explicitly using partitionBy
> After the shuffle, do another local groupByKey to get the final result,
> again potentially spilling individual groups to disk
>
> My question is: what does the explicit map-side-combine step (#1)
> specifically benefit here? I was under the impression that map-side-combine
> for groupByKey was not optimal and is turned off in the Scala implementation
> – Scala PairRDDFunctions.groupByKey calls to combineByKey with
> map-side-combine set to false. Is it something specific to how Pyspark can
> potentially spill the individual groups to disk?
>
> Thanks,
>
> -Matt Cheah
>
> P.S. Relevant Links:
>
> https://issues.apache.org/jira/browse/SPARK-3074
> https://github.com/apache/spark/pull/1977
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
I think we should start without map-side-combine for Scala, because
it's easy to OOM
in JVM than in Python (we don't have hard limit in Python yet).

On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah  wrote:
> Should we actually enable map-side-combine for groupByKey in Scala RDD as
> well, then? If we implement external-group-by should we implement it with
> the map-side-combine semantics that Pyspark does?

> -Matt Cheah
>
> On 7/15/15, 8:21 AM, "Davies Liu"  wrote:
>
>>If the map-side-combine is not that necessary, given the fact that it
>>cannot
>>reduce the size of data for shuffling much (do need to serialized the key
>>for
>>each value), but can reduce the number of key-value pairs, and potential
>>reduce
>>the number of operations later (repartition and groupby).
>>
>>On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah  wrote:
>>> Hi everyone,
>>>
>>> I was examining the Pyspark implementation of groupByKey in rdd.py. I
>>>would
>>> like to submit a patch improving Scala RDD¹s groupByKey that has a
>>>similar
>>> robustness against large groups, as Pyspark¹s implementation has logic
>>>to
>>> spill part of a single group to disk along the way.
>>>
>>> Its implementation appears to do the following:
>>>
>>> Combine and group-by-key per partition locally, potentially spilling
>>> individual groups to disk
>>> Shuffle the data explicitly using partitionBy
>>> After the shuffle, do another local groupByKey to get the final result,
>>> again potentially spilling individual groups to disk
>>>
>>> My question is: what does the explicit map-side-combine step (#1)
>>> specifically benefit here? I was under the impression that
>>>map-side-combine
>>> for groupByKey was not optimal and is turned off in the Scala
>>>implementation
>>> ­ Scala PairRDDFunctions.groupByKey calls to combineByKey with
>>> map-side-combine set to false. Is it something specific to how Pyspark
>>>can
>>> potentially spill the individual groups to disk?
>>>
>>> Thanks,
>>>
>>> -Matt Cheah
>>>
>>> P.S. Relevant Links:
>>>
>>>
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
>>>ra_browse_SPARK-2D3074&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
>>>nmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZI
>>>fTWTdAjACOGi3ozEffRaiBo&s=Onqi4oR_J4X2tV5u5NLiSnGdt31rRhHtD8R4KjBCQ9g&e=
>>>
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_pull_1977&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hz
>>>wIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZIfTWTdAjAC
>>>OGi3ozEffRaiBo&s=weq4Epxezp-hx8AdFlbd4dWSqllNppF5HNhJC1KhTCI&e=
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark on PyPi

2015-08-06 Thread Davies Liu
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.

On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
 wrote:
> +1 (once again :) )
>
> 2015-07-28 14:51 GMT+02:00 Justin Uang :
>>
>> // ping
>>
>> do we have any signoff from the pyspark devs to submit a PR to publish to
>> PyPI?
>>
>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman 
>> wrote:
>>>
>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>> steps that make it easier to use PySpark as an ordinary python library.
>>>
>>> You might want to check out this (https://github.com/minrk/findspark),
>>> started by Jupyter project devs, that offers one way to facilitate this
>>> stuff. I’ve also cced them here to join the conversation.
>>>
>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>> just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)`
>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>> set correctly on *both* workers and driver. That said, there’s definitely
>>> additional configuration / functionality that would require going through
>>> the proper submit scripts.
>>>
>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal 
>>> wrote:
>>>
>>> I agree with everything Justin just said. An additional advantage of
>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>> way that pip can use. Contrast this with the current situation, where
>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>> install Pandas.
>>>
>>> Punya
>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang 
>>> wrote:

 // + Davies for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be
 hugely beneficial to publish pyspark (only code in the python/ dir) on 
 PyPI.
 If anyone wants to develop against PySpark APIs, they need to download the
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
 pytest, IDE code completion). Right now that involves adding python/ and
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
 dependencies, we would have to manually mirror all the PYTHONPATH munging 
 in
 the ./pyspark script. With a proper pyspark setup.py which declares its
 dependencies, and a published distribution, depending on pyspark will just
 be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce 
 that
 the version has to be exact, which means there is no backcompat nightmare 
 as
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
 This would mean that even if the user got his pip installed pyspark to
 somehow get loaded before the spark distribution provided pyspark, then the
 user would be alerted immediately.

 Davies, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
  wrote:
>
> Ok, I get it. Now what can we do to improve the current situation,
> because right now if I want to set-up a CI env for PySpark, I have to :
> 1- download a pre-built version of pyspark and unzip it somewhere on
> every agent
> 2- define the SPARK_HOME env
> 3- symlink this distribution pyspark dir inside the python install dir
> site-packages/ directory
> and if I rely on additional packages (like databricks' Spark-CSV
> project), I have to (except if I'm mistaken)
> 4- compile/assembly spark-csv, deploy the jar in a specific directory
> on every agent
> 5- add this jar-filled directory to the Spark distribution's additional
> classpath using the conf/spark-default file
>
> Then finally we can launch our unit/integration-tests.
> Some issues are related to spark-packages, some to the lack of
> python-based dependency, and some to the way SparkContext are launched 
> when
> using pyspark.
> I think step 1 and 2 are fair enough
> 4 and 5 may already have solutions, I didn't check and considering
> spark-shell is downloading such dependencies automatically, I think if
> nothing's done yet it will (I guess ?).
>
> For step 3, maybe just adding a setup.py

Re: PySpark on PyPi

2015-08-10 Thread Davies Liu
I think so, any contributions on this are welcome.

On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger  wrote:
> Sorry, trying to follow the context here. Does it look like there is
> support for the idea of creating a setup.py file and pypi package for
> pyspark?
>
> Cheers,
>
> Brian
>
> On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu  wrote:
>> We could do that after 1.5 released, it will have same release cycle
>> as Spark in the future.
>>
>> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>  wrote:
>>> +1 (once again :) )
>>>
>>> 2015-07-28 14:51 GMT+02:00 Justin Uang :
>>>>
>>>> // ping
>>>>
>>>> do we have any signoff from the pyspark devs to submit a PR to publish to
>>>> PyPI?
>>>>
>>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman 
>>>> wrote:
>>>>>
>>>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>>>> steps that make it easier to use PySpark as an ordinary python library.
>>>>>
>>>>> You might want to check out this (https://github.com/minrk/findspark),
>>>>> started by Jupyter project devs, that offers one way to facilitate this
>>>>> stuff. I’ve also cced them here to join the conversation.
>>>>>
>>>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>>>> just using `from pyspark import SparkContext; sc = 
>>>>> SparkContext(master=“X”)`
>>>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>>>> set correctly on *both* workers and driver. That said, there’s definitely
>>>>> additional configuration / functionality that would require going through
>>>>> the proper submit scripts.
>>>>>
>>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal 
>>>>> wrote:
>>>>>
>>>>> I agree with everything Justin just said. An additional advantage of
>>>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>>>> way that pip can use. Contrast this with the current situation, where
>>>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>>>> install Pandas.
>>>>>
>>>>> Punya
>>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang 
>>>>> wrote:
>>>>>>
>>>>>> // + Davies for his comments
>>>>>> // + Punya for SA
>>>>>>
>>>>>> For development and CI, like Olivier mentioned, I think it would be
>>>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on 
>>>>>> PyPI.
>>>>>> If anyone wants to develop against PySpark APIs, they need to download 
>>>>>> the
>>>>>> distribution and do a lot of PYTHONPATH munging for all the tools 
>>>>>> (pylint,
>>>>>> pytest, IDE code completion). Right now that involves adding python/ and
>>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>>>>>> dependencies, we would have to manually mirror all the PYTHONPATH 
>>>>>> munging in
>>>>>> the ./pyspark script. With a proper pyspark setup.py which declares its
>>>>>> dependencies, and a published distribution, depending on pyspark will 
>>>>>> just
>>>>>> be adding pyspark to my setup.py dependencies.
>>>>>>
>>>>>> Of course, if we actually want to run parts of pyspark that is backed by
>>>>>> Py4J calls, then we need the full spark distribution with either 
>>>>>> ./pyspark
>>>>>> or ./spark-submit, but for things like linting and development, the
>>>>>> PYTHONPATH munging is very annoying.
>>>>>>
>>>>>> I don't think the version-mismatch issues are a compelling reason to not
>>>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce 
>>>>>> that
>>>>>> the version has to be exact, which means there is no backcompat 
>>>>>> nightmare as
>>>>>> suggested by Davies in https://issues.a

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Davies Liu
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran
performance benchmark for aggregation and join with difference scales,
all worked well.

On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust
 wrote:
> +1 Ran TPC-DS and ported several jobs over to 1.5
>
> On Thu, Sep 3, 2015 at 9:57 AM, Burak Yavuz  wrote:
>>
>> +1. Tested complex R package support (Scala + R code), BLAS and DataFrame
>> fixes good.
>>
>> Burak
>>
>> On Thu, Sep 3, 2015 at 8:56 AM, mkhaitman 
>> wrote:
>>>
>>> Built and tested on CentOS 7, Hadoop 2.7.1 (Built for 2.6 profile),
>>> Standalone without any problems. Re-tested dynamic allocation
>>> specifically.
>>>
>>> "Lost executor" messages are still an annoyance since they're expected to
>>> occur with dynamic allocation, and shouldn't WARN/ERROR as they do now,
>>> however there's already a JIRA ticket for it:
>>> https://issues.apache.org/jira/browse/SPARK-4134 . Will probably have to
>>> filter these messages out in log4j properties for this release!
>>>
>>> Mark.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC3-tp13928p13948.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Davies Liu
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar  wrote:
> Yin,
>It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> 
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai  wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> 
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves  wrote:

 The upper/lower case thing is known.
 https://issues.apache.org/jira/browse/SPARK-9550
 I assume it was decided to be ok and its going to be in the release
 notes  but Reynold or Josh can probably speak to it more.

 Tom



 On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
  wrote:


 +?

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word
 count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql("SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
 OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. All joins,sql,set operations,udf OK

 Two Problems:

 1. The synthetic column names are lowercase ( i.e. now
 ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
 previously 'AVG(Total)'). So programs that depend on the case of the
 synthetic column names would fail.
 2. orders_3.groupBy("Year","Month").sum('Total').show()
 fails with the error ‘java.io.IOException: Unable to acquire 4194304
 bytes of memory’
 orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
 with the same error
 Is this a known bug ?
 Cheers
 
 P.S: Sorry for the spam, forgot Reply All

 On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes 
 if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc3:

 https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release (published as 1.5.0-rc3) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1143/

 The staging repository for this release (published as 1.5.0) can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1142/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/


 ===
 How can I help test this release?
 ===
 If you are

Re: Pyspark DataFrame TypeError

2015-09-08 Thread Davies Liu
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected:

```
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0, 
>>> Vectors.sparse(1, [], []))], ["label", "featuers"])
>>> df.show()
+-+-+
|label| featuers|
+-+-+
|  1.0|[1.0]|
|  0.0|(1,[],[])|
+-+-+

>>> df.columns
['label', 'featuers']
```

On Tue, Sep 8, 2015 at 1:45 AM, Prabeesh K.  wrote:
> I am trying to run the code RandomForestClassifier example in the PySpark
> 1.4.1 documentation,
> https://spark.apache.org/docs/1.4.1/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.
>
> Below is screen shot of ipython notebook
>
>
>
> But for df.columns. It shows following error.
>
>
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.columns
>
> /home/datasci/src/spark/python/pyspark/sql/dataframe.pyc in columns(self)
> 484 ['age', 'name']
> 485 """
> --> 486 return [f.name for f in self.schema.fields]
> 487
> 488 @ignore_unicode_prefix
>
> /home/datasci/src/spark/python/pyspark/sql/dataframe.pyc in schema(self)
> 194 """
> 195 if self._schema is None:
> --> 196 self._schema =
> _parse_datatype_json_string(self._jdf.schema().json())
> 197 return self._schema
> 198
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> _parse_datatype_json_string(json_string)
> 519 >>> check_datatype(structtype_with_udt)
> 520 """
> --> 521 return _parse_datatype_json_value(json.loads(json_string))
> 522
> 523
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> _parse_datatype_json_value(json_value)
> 539 tpe = json_value["type"]
> 540 if tpe in _all_complex_types:
> --> 541 return _all_complex_types[tpe].fromJson(json_value)
> 542 elif tpe == 'udt':
> 543 return UserDefinedType.fromJson(json_value)
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls, json)
> 386 @classmethod
> 387 def fromJson(cls, json):
> --> 388 return StructType([StructField.fromJson(f) for f in
> json["fields"]])
> 389
> 390
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls, json)
> 347 def fromJson(cls, json):
> 348 return StructField(json["name"],
> --> 349_parse_datatype_json_value(json["type"]),
> 350json["nullable"],
> 351json["metadata"])
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in
> _parse_datatype_json_value(json_value)
> 541 return _all_complex_types[tpe].fromJson(json_value)
> 542 elif tpe == 'udt':
> --> 543 return UserDefinedType.fromJson(json_value)
> 544 else:
> 545 raise ValueError("not supported type: %s" % tpe)
>
> /home/datasci/src/spark/python/pyspark/sql/types.pyc in fromJson(cls, json)
> 453 pyModule = pyUDT[:split]
> 454 pyClass = pyUDT[split+1:]
> --> 455 m = __import__(pyModule, globals(), locals(), [pyClass])
> 456 UDT = getattr(m, pyClass)
> 457 return UDT()
>
> TypeError: Item in ``from list'' not a string
>
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: pyspark streaming DStream compute

2015-09-15 Thread Davies Liu
On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong  wrote:
> Can anybody help understand why pyspark streaming uses py4j callback to
> execute python code while pyspark batch uses worker.py?

There are two kind of callback in pyspark streaming:
1) one operate on RDDs, it take an RDD and return an new RDD, uses
py4j callback,
because SparkContext and RDDs are not accessible in worker.py
2) operate on records of RDD, it take an record and return new
records, uses worker.py

> regarding pyspark streaming, is py4j callback only used for DStream,
> worker.py still used for RDD?

Yes.

> thanks,
> Renyi.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet?

On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
 wrote:
> Hi,
>
> We're building our own framework on top of spark and we give users pretty
> complex schema to work with. That requires from us to build dataframes by
> ourselves: we transform business objects to rows and struct types and uses
> these two to create dataframe.
>
> Everything was fine until I started to upgrade to spark 1.5.0 (from 1.3.1).
> Seems to be catalyst engine has been changed and now using almost the same
> code to produce rows and struct types I have the following:
> http://ibin.co/2HzUsoe9O96l, some of rows in the end result have different
> number of values and corresponding struct types.
>
> I'm almost sure it's my own fault, but there is always a small chance, that
> something is wrong in spark codebase. If you've seen something similar or if
> there is a jira for smth similar, I'd be glad to know. Thanks.
> --
> Be well!
> Jean Morozov

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint
than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x
higher than LZF[2],
but memory used is 10x smaller than LZF (16k vs 190k).

[1] https://github.com/jpountz/lz4-java
[2] 
http://ning.github.io/jvm-compressor-benchmark/results/calgary/roundtrip-2013-06-06/index.html


On Mon, Jul 14, 2014 at 12:01 AM, Reynold Xin  wrote:
>
> Hi Spark devs,
>
> I was looking into the memory usage of shuffle and one annoying thing is
> the default compression codec (LZF) is that the implementation we use
> allocates buffers pretty generously. I did a simple experiment and found
> that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
> we have a shuffle task that uses 10k reducers and 32 threads running
> currently, the memory used by the lzf stream alone would be ~ 60GB.
>
> In comparison, Snappy only allocates ~ 65MB for every
> 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
> LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
> ratio does matter here because we are sending data across the network.
>
> In future releases we will likely change the shuffle implementation to open
> less streams. Until that happens, I'm looking for compression codec
> implementations that are fast, allocate small buffers, and have decent
> compression ratio.
>
> Does anybody on this list have any suggestions? If not, I will submit a
> patch for 1.1 that replaces LZF with Snappy for the default compression
> codec to lower memory usage.
>
>
> allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567


Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
 wrote:
> Yep, I thought it was a bogus comparison.
>
> I should rephrase my question as it was poorly phrased: on average, how
> much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
> I've only used Spark and don't have a chance to test this at the moment so
> if anybody has these numbers or general estimates (10x, etc), that'd be
> great.

A quick comparison by word count on 4.3G text file (local mode),

Spark:  40 seconds
PySpark: 2 minutes and 16 seconds

So PySpark is 3.4x slower than Spark.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu  wrote:
> On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
>  wrote:
>> Yep, I thought it was a bogus comparison.
>>
>> I should rephrase my question as it was poorly phrased: on average, how
>> much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
>> I've only used Spark and don't have a chance to test this at the moment so
>> if anybody has these numbers or general estimates (10x, etc), that'd be
>> great.
>
> A quick comparison by word count on 4.3G text file (local mode),
>
> Spark:  40 seconds
> PySpark: 2 minutes and 16 seconds
>
> So PySpark is 3.4x slower than Spark.

I also tried DPark, which is a pure Python clone of Spark:

DPark: 53 seconds

so it's 2 times faster than PySpark, because of it does not have
the over head of passing data between JVM and Python.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: TorrentBroadcast slow performance

2014-10-07 Thread Davies Liu
Could you create a JIRA for it? maybe it's a regression after
https://issues.apache.org/jira/browse/SPARK-3119.

We will appreciate that if you could tell how to reproduce it.

On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
 wrote:
> Hi,
>
> I've had no answer to this on u...@spark.apache.org, so I post it on dev
> before filing a JIRA (in case the problem or solution is already identified)
>
> We've had some performance issues since switching to 1.1.0, and we finally
> found the origin : TorrentBroadcast seems to be very slow in our setting
> (and it became default with 1.1.0)
>
> The logs of a 4MB variable with TorrentBroadcast : (15s)
>
> 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 stored
> as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
> 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_84_piece1
> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called
> with curMem=1401611984, maxMem=9168696115
> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 stored
> as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
> 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_84_piece0
> 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
> variable 84 took 15.202260006 s
> 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called
> with curMem=1405806288, maxMem=9168696115
> 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
> values in memory (estimated size 4.2 MB, free 7.2 GB)
>
> (notice that a 10s lag happens after the "Updated info of block
> broadcast_..." and before the MemoryStore log
>
> And with HttpBroadcast (0.3s):
>
> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
> variable 147
> 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called
> with curMem=1373493232, maxMem=9168696115
> 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
> values in memory (estimated size 4.2 MB, free 7.3 GB)
> 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
> 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
> block broadcast_147 locally
>
> Since Torrent is supposed to perform much better than Http, we suspect a
> configuration error from our side, but are unable to pin it down. Does
> someone have any idea of the origin of the problem ?
>
> For now we're sticking with the HttpBroadcast workaround.
>
> Guillaume
> --
> Guillaume PITEL, Président
> +33(0)626 222 431
>
> eXenSa S.A.S.
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-17 Thread Davies Liu
One finding is that all the timeout happened with this command:

git fetch --tags --progress https://github.com/apache/spark.git
+refs/pull/*:refs/remotes/origin/pr/*

I'm thinking that maybe this may be a expensive call, we could try to
use a more cheap one:

git fetch --tags --progress https://github.com/apache/spark.git
+refs/pull/XXX/*:refs/remotes/origin/pr/XXX/*

XXX is the PullRequestID,

The configuration support parameters [1], so we could put this in :

+refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*

I have not tested this yet, could you give this a try?

Davies


[1] 
https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin

On Fri, Oct 17, 2014 at 5:00 PM, shane knapp  wrote:
> actually, nvm, you have to be run that command from our servers to affect
> our limit.  run it all you want from your own machines!  :P
>
> On Fri, Oct 17, 2014 at 4:59 PM, shane knapp  wrote:
>
>> yep, and i will tell you guys ONLY if you promise to NOT try this
>> yourselves...  checking the rate limit also counts as a hit and increments
>> our numbers:
>>
>> # curl -i https://api.github.com/users/whatever 2> /dev/null | egrep
>> ^X-Rate
>> X-RateLimit-Limit: 60
>> X-RateLimit-Remaining: 51
>> X-RateLimit-Reset: 1413590269
>>
>> (yes, that is the exact url that they recommended on the github site lol)
>>
>> so, earlier today, we had a spark build fail w/a git timeout at 10:57am,
>> but there were only ~7 builds run that hour, so that points to us NOT
>> hitting the rate limit...  at least for this fail.  whee!
>>
>> is it beer-thirty yet?
>>
>> shane
>>
>>
>>
>> On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Wow, thanks for this deep dive Shane. Is there a way to check if we are
>>> getting hit by rate limiting directly, or do we need to contact GitHub
>>> for that?
>>>
>>> 2014년 10월 17일 금요일, shane knapp님이 작성한 메시지:
>>>
>>> quick update:

 here are some stats i scraped over the past week of ALL pull request
 builder projects and timeout failures.  due to the large number of spark
 ghprb jobs, i don't have great records earlier than oct 7th.  the data is
 current up until ~230pm today:

 spark and new spark ghprb total builds vs git fetch timeouts:
 $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i
 spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let
 total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc |
 sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:
  $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done
 10-09 -- total builds: 140 p/f: 92/48 fail%: 34%
 10-10 -- total builds: 65 p/f: 59/6 fail%: 09%
 10-11 -- total builds: 29 p/f: 29/0 fail%: 0%
 10-12 -- total builds: 24 p/f: 21/3 fail%: 12%
 10-13 -- total builds: 39 p/f: 35/4 fail%: 10%
 10-14 -- total builds: 7 p/f: 5/2 fail%: 28%
 10-15 -- total builds: 37 p/f: 34/3 fail%: 08%
 10-16 -- total builds: 71 p/f: 59/12 fail%: 16%
 10-17 -- total builds: 26 p/f: 20/6 fail%: 23%

 all other ghprb builds vs git fetch timeouts:
 $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi
 spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let
 total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc |
 sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:
  $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done
 10-09 -- total builds: 16 p/f: 16/0 fail%: 0%
 10-10 -- total builds: 46 p/f: 40/6 fail%: 13%
 10-11 -- total builds: 4 p/f: 4/0 fail%: 0%
 10-12 -- total builds: 2 p/f: 2/0 fail%: 0%
 10-13 -- total builds: 2 p/f: 2/0 fail%: 0%
 10-14 -- total builds: 10 p/f: 10/0 fail%: 0%
 10-15 -- total builds: 5 p/f: 5/0 fail%: 0%
 10-16 -- total builds: 5 p/f: 5/0 fail%: 0%
 10-17 -- total builds: 0 p/f: 0/0 fail%: 0%

 note:  the 15th was the day i rolled back to the earlier version of the
 git plugin.  it doesn't seem to have helped much, so i'll probably bring us
 back up to the latest version soon.
 also note:  rocking some floating point math on the CLI!  ;)

 i also compared the distribution of git timeout failures vs time of day,
 and there appears to be no correlation.  the failures are pretty evenly
 distributed over each hour of the day.

 we could be hitting the rate limit due to the ghprb hitting github a
 couple of times for each build, but we're averaging ~10-20 builds per hour
 (a build hits github 2-4 times, from what i can tell).  i'll have to look
 more in to this on monday, but suffice to say we may need to move from
 unauthorized https fetches to authorized requests.  this means retrofitting
 all of our jobs.  yay!  fun!  :)

 another option is to have local mirrors of all of the repos.  the
 problem w/this is that there migh

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-17 Thread Davies Liu
How can we know the changes has been applied? I had checked several
recent builds, they all use the original configs.

Davies

On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen  wrote:
> FYI, I edited the Spark Pull Request Builder job to try this out.  Let’s see
> if it works (I’ll be around to revert if it doesn’t).
>
> On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote:
>
> One finding is that all the timeout happened with this command:
>
> git fetch --tags --progress https://github.com/apache/spark.git
> +refs/pull/*:refs/remotes/origin/pr/*
>
> I'm thinking that maybe this may be a expensive call, we could try to
> use a more cheap one:
>
> git fetch --tags --progress https://github.com/apache/spark.git
> +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/*
>
> XXX is the PullRequestID,
>
> The configuration support parameters [1], so we could put this in :
>
> +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
>
> I have not tested this yet, could you give this a try?
>
> Davies
>
>
> [1]
> https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin
>
> On Fri, Oct 17, 2014 at 5:00 PM, shane knapp  wrote:
>> actually, nvm, you have to be run that command from our servers to affect
>> our limit. run it all you want from your own machines! :P
>>
>> On Fri, Oct 17, 2014 at 4:59 PM, shane knapp  wrote:
>>
>>> yep, and i will tell you guys ONLY if you promise to NOT try this
>>> yourselves... checking the rate limit also counts as a hit and increments
>>> our numbers:
>>>
>>> # curl -i https://api.github.com/users/whatever 2> /dev/null | egrep
>>> ^X-Rate
>>> X-RateLimit-Limit: 60
>>> X-RateLimit-Remaining: 51
>>> X-RateLimit-Reset: 1413590269
>>>
>>> (yes, that is the exact url that they recommended on the github site lol)
>>>
>>> so, earlier today, we had a spark build fail w/a git timeout at 10:57am,
>>> but there were only ~7 builds run that hour, so that points to us NOT
>>> hitting the rate limit... at least for this fail. whee!
>>>
>>> is it beer-thirty yet?
>>>
>>> shane
>>>
>>>
>>>
>>> On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Wow, thanks for this deep dive Shane. Is there a way to check if we are
>>>> getting hit by rate limiting directly, or do we need to contact GitHub
>>>> for that?
>>>>
>>>> 2014년 10월 17일 금요일, shane knapp님이 작성한 메시지:
>>>>
>>>> quick update:
>>>>>
>>>>> here are some stats i scraped over the past week of ALL pull request
>>>>> builder projects and timeout failures. due to the large number of spark
>>>>> ghprb jobs, i don't have great records earlier than oct 7th. the data
>>>>> is
>>>>> current up until ~230pm today:
>>>>>
>>>>> spark and new spark ghprb total builds vs git fetch timeouts:
>>>>> $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i
>>>>> spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let
>>>>> total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc
>>>>> |
>>>>> sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:
>>>>> $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done
>>>>> 10-09 -- total builds: 140 p/f: 92/48 fail%: 34%
>>>>> 10-10 -- total builds: 65 p/f: 59/6 fail%: 09%
>>>>> 10-11 -- total builds: 29 p/f: 29/0 fail%: 0%
>>>>> 10-12 -- total builds: 24 p/f: 21/3 fail%: 12%
>>>>> 10-13 -- total builds: 39 p/f: 35/4 fail%: 10%
>>>>> 10-14 -- total builds: 7 p/f: 5/2 fail%: 28%
>>>>> 10-15 -- total builds: 37 p/f: 34/3 fail%: 08%
>>>>> 10-16 -- total builds: 71 p/f: 59/12 fail%: 16%
>>>>> 10-17 -- total builds: 26 p/f: 20/6 fail%: 23%
>>>>>
>>>>> all other ghprb builds vs git fetch timeouts:
>>>>> $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi
>>>>> spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let
>>>>> total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc
>>>>> |
>>>>> sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:
>>>>> $passed/$failed\tfail%: $fail_perce

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-18 Thread Davies Liu
Cool, the recent 4 build had used the new configs, thanks!

Let's run more builds.

Davies

On Fri, Oct 17, 2014 at 11:06 PM, Josh Rosen  wrote:
> I think that the fix was applied.  Take a look at
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull
>
> Here, I see a fetch command that mentions this specific PR branch rather
> than the wildcard that we had before:
>
>  > git fetch --tags --progress https://github.com/apache/spark.git
> +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15
>
>
> Do you have an example of a Spark PRB build that’s still failing with the
> old fetch failure?
>
> - Josh
>
> On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com)
> wrote:
>
> How can we know the changes has been applied? I had checked several
> recent builds, they all use the original configs.
>
> Davies
>
> On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen  wrote:
>> FYI, I edited the Spark Pull Request Builder job to try this out. Let’s
>> see
>> if it works (I’ll be around to revert if it doesn’t).
>>
>> On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com)
>> wrote:
>>
>> One finding is that all the timeout happened with this command:
>>
>> git fetch --tags --progress https://github.com/apache/spark.git
>> +refs/pull/*:refs/remotes/origin/pr/*
>>
>> I'm thinking that maybe this may be a expensive call, we could try to
>> use a more cheap one:
>>
>> git fetch --tags --progress https://github.com/apache/spark.git
>> +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/*
>>
>> XXX is the PullRequestID,
>>
>> The configuration support parameters [1], so we could put this in :
>>
>> +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
>>
>> I have not tested this yet, could you give this a try?
>>
>> Davies
>>
>>
>> [1]
>>
>> https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin
>>
>> On Fri, Oct 17, 2014 at 5:00 PM, shane knapp  wrote:
>>> actually, nvm, you have to be run that command from our servers to affect
>>> our limit. run it all you want from your own machines! :P
>>>
>>> On Fri, Oct 17, 2014 at 4:59 PM, shane knapp  wrote:
>>>
>>>> yep, and i will tell you guys ONLY if you promise to NOT try this
>>>> yourselves... checking the rate limit also counts as a hit and
>>>> increments
>>>> our numbers:
>>>>
>>>> # curl -i https://api.github.com/users/whatever 2> /dev/null | egrep
>>>> ^X-Rate
>>>> X-RateLimit-Limit: 60
>>>> X-RateLimit-Remaining: 51
>>>> X-RateLimit-Reset: 1413590269
>>>>
>>>> (yes, that is the exact url that they recommended on the github site
>>>> lol)
>>>>
>>>> so, earlier today, we had a spark build fail w/a git timeout at 10:57am,
>>>> but there were only ~7 builds run that hour, so that points to us NOT
>>>> hitting the rate limit... at least for this fail. whee!
>>>>
>>>> is it beer-thirty yet?
>>>>
>>>> shane
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Wow, thanks for this deep dive Shane. Is there a way to check if we are
>>>>> getting hit by rate limiting directly, or do we need to contact GitHub
>>>>> for that?
>>>>>
>>>>> 2014년 10월 17일 금요일, shane knapp님이 작성한 메시지:
>>>>>
>>>>> quick update:
>>>>>>
>>>>>> here are some stats i scraped over the past week of ALL pull request
>>>>>> builder projects and timeout failures. due to the large number of
>>>>>> spark
>>>>>> ghprb jobs, i don't have great records earlier than oct 7th. the data
>>>>>> is
>>>>>> current up until ~230pm today:
>>>>>>
>>>>>> spark and new spark ghprb total builds vs git fetch timeouts:
>>>>>> $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i
>>>>>> spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let
>>>>>> total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" |
>>>>>> bc
>>>>>> |
>>>>>> sed "s/^\.//g"); line="$x -- total builds: $total\tp

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
-1 (not binding, +1 for maintainer, -1 for sign off)

Agree with Greg and Vinod. In the beginning, everything is better
(more efficient, more focus), but after some time, fighting begins.

Code style is the most hot topic to fight (we already saw it in some
PRs). If two committers (one of them is maintainer) have not got a
agreement on code style, before this process, they will ask comments
from other committers, but after this process, the maintainer have
higher priority to -1, then maintainer will keep his/her personal
preference, it's hard to make a agreement. Finally, different
components will have different code style (or others).

Right now, maintainers are kind of first contact or best contacts, the
best person to review the PR in that component. We could announce it,
then new contributors can easily find the right one to review.

My 2 cents.

Davies


On Thu, Nov 6, 2014 at 11:43 PM, Vinod Kumar Vavilapalli
 wrote:
>> With the maintainer model, the process is as follows:
>>
>> - Any committer could review the patch and merge it, but they would need to 
>> forward it to me (or another core API maintainer) to make sure we also 
>> approve
>> - At any point during this process, I could come in and -1 it, or give 
>> feedback
>> - In addition, any other committer beyond me is still allowed to -1 this 
>> patch
>>
>> The only change in this model is that committers are responsible to forward 
>> patches in these areas to certain other committers. If every committer had 
>> perfect oversight of the project, they could have also seen every patch to 
>> their component on their own, but this list ensures that they see it even if 
>> they somehow overlooked it.
>
>
> Having done the job of playing an informal 'maintainer' of a project myself, 
> this is what I think you really need:
>
> The so called 'maintainers' do one of the below
>  - Actively poll the lists and watch over contributions. And follow what is 
> repeated often around here: Trust but verify.
>  - Setup automated mechanisms to send all bug-tracker updates of a specific 
> component to a list that people can subscribe to
>
> And/or
>  - Individual contributors send review requests to unofficial 'maintainers' 
> over dev-lists or through tools. Like many projects do with review boards and 
> other tools.
>
> Note that none of the above is a required step. It must not be, that's the 
> point. But once set as a convention, they will all help you address your 
> concerns with project scalability.
>
> Anything else that you add is bestowing privileges to a select few and 
> forming dictatorships. And contrary to what the proposal claims, this is 
> neither scalable nor confirming to Apache governance rules.
>
> +Vinod

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
Sorry for my last email, I misunderstood the proposal here, all the
committer still have equal -1 to all the code changes.

Also, as mentioned in the proposal, the sign off only happens to
public API and architect, something like discussion about code style
things are still the same.

So, I'd revert my vote to +1. Sorry for this.

Davies


On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu  wrote:
> -1 (not binding, +1 for maintainer, -1 for sign off)
>
> Agree with Greg and Vinod. In the beginning, everything is better
> (more efficient, more focus), but after some time, fighting begins.
>
> Code style is the most hot topic to fight (we already saw it in some
> PRs). If two committers (one of them is maintainer) have not got a
> agreement on code style, before this process, they will ask comments
> from other committers, but after this process, the maintainer have
> higher priority to -1, then maintainer will keep his/her personal
> preference, it's hard to make a agreement. Finally, different
> components will have different code style (or others).
>
> Right now, maintainers are kind of first contact or best contacts, the
> best person to review the PR in that component. We could announce it,
> then new contributors can easily find the right one to review.
>
> My 2 cents.
>
> Davies
>
>
> On Thu, Nov 6, 2014 at 11:43 PM, Vinod Kumar Vavilapalli
>  wrote:
>>> With the maintainer model, the process is as follows:
>>>
>>> - Any committer could review the patch and merge it, but they would need to 
>>> forward it to me (or another core API maintainer) to make sure we also 
>>> approve
>>> - At any point during this process, I could come in and -1 it, or give 
>>> feedback
>>> - In addition, any other committer beyond me is still allowed to -1 this 
>>> patch
>>>
>>> The only change in this model is that committers are responsible to forward 
>>> patches in these areas to certain other committers. If every committer had 
>>> perfect oversight of the project, they could have also seen every patch to 
>>> their component on their own, but this list ensures that they see it even 
>>> if they somehow overlooked it.
>>
>>
>> Having done the job of playing an informal 'maintainer' of a project myself, 
>> this is what I think you really need:
>>
>> The so called 'maintainers' do one of the below
>>  - Actively poll the lists and watch over contributions. And follow what is 
>> repeated often around here: Trust but verify.
>>  - Setup automated mechanisms to send all bug-tracker updates of a specific 
>> component to a list that people can subscribe to
>>
>> And/or
>>  - Individual contributors send review requests to unofficial 'maintainers' 
>> over dev-lists or through tools. Like many projects do with review boards 
>> and other tools.
>>
>> Note that none of the above is a required step. It must not be, that's the 
>> point. But once set as a convention, they will all help you address your 
>> concerns with project scalability.
>>
>> Anything else that you add is bestowing privileges to a select few and 
>> forming dictatorships. And contrary to what the proposal claims, this is 
>> neither scalable nor confirming to Apache governance rules.
>>
>> +Vinod

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org