Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Punyashloka Biswal
Would it make sense to isolate the use of deprecated warnings to a subset
of projects? That way we could turn on more stringent checks for the other
ones.

Punya

On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote:

 Hi all,

 FYI, we just merged a patch that fails a build if there is a scala
 compiler warning (if it is not deprecation warning).

 In the past, many compiler warnings are actually caused by legitimate bugs
 that we need to address. However, if we don't fail the build with warnings,
 people don't pay attention at all to the warnings (it is also tough to pay
 attention since there are a lot of deprecated warnings due to unit tests
 testing deprecated APIs and reliance on Hadoop on deprecated APIs).

 Note that ideally we should be able to mark deprecation warnings as errors
 as well. However, due to the lack of ability to suppress individual warning
 messages in the Scala compiler, we cannot do that (since we do need to
 access deprecated APIs in Hadoop).





Re: PySpark on PyPi

2015-07-22 Thread Punyashloka Biswal
I agree with everything Justin just said. An additional advantage of
publishing PySpark's Python code in a standards-compliant way is the fact
that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
way that pip can use. Contrast this with the current situation, where
df.toPandas() exists in the Spark API but doesn't actually work until you
install Pandas.

Punya
On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote:

 // + *Davies* for his comments
 // + Punya for SA

 For development and CI, like Olivier mentioned, I think it would be hugely
 beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
 anyone wants to develop against PySpark APIs, they need to download the
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
 pytest, IDE code completion). Right now that involves adding python/ and
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
 dependencies, we would have to manually mirror all the PYTHONPATH munging
 in the ./pyspark script. With a proper pyspark setup.py which declares its
 dependencies, and a published distribution, depending on pyspark will just
 be adding pyspark to my setup.py dependencies.

 Of course, if we actually want to run parts of pyspark that is backed by
 Py4J calls, then we need the full spark distribution with either ./pyspark
 or ./spark-submit, but for things like linting and development, the
 PYTHONPATH munging is very annoying.

 I don't think the version-mismatch issues are a compelling reason to not
 go ahead with PyPI publishing. At runtime, we should definitely enforce
 that the version has to be exact, which means there is no backcompat
 nightmare as suggested by Davies in
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that
 even if the user got his pip installed pyspark to somehow get loaded before
 the spark distribution provided pyspark, then the user would be alerted
 immediately.

 *Davies*, if you buy this, should me or someone on my team pick up
 https://issues.apache.org/jira/browse/SPARK-1267 and
 https://github.com/apache/spark/pull/464?

 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Ok, I get it. Now what can we do to improve the current situation,
 because right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on
 every agent
 2- define the SPARK_HOME env
 3- symlink this distribution pyspark dir inside the python install dir
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV
 project), I have to (except if I'm mistaken)
 4- compile/assembly spark-csv, deploy the jar in a specific directory on
 every agent
 5- add this jar-filled directory to the Spark distribution's additional
 classpath using the conf/spark-default file

 Then finally we can launch our unit/integration-tests.
 Some issues are related to spark-packages, some to the lack of
 python-based dependency, and some to the way SparkContext are launched when
 using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering
 spark-shell is downloading such dependencies automatically, I think if
 nothing's done yet it will (I guess ?).

 For step 3, maybe just adding a setup.py to the distribution would be
 enough, I'm not exactly advocating to distribute a full 300Mb spark
 distribution in PyPi, maybe there's a better compromise ?

 Regards,

 Olivier.

 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit :

 Couldn't we have a pip installable pyspark package that just serves as
 a shim to an existing Spark installation? Or it could even download the
 latest Spark binary if SPARK_HOME isn't set during installation. Right now,
 Spark doesn't play very well with the usual Python ecosystem. For example,
 why do I need to use a strange incantation when booting up IPython if I
 want to use PySpark in a notebook with MASTER=local[4]? It would be much
 nicer to just type `from pyspark import SparkContext; sc =
 SparkContext(local[4])` in my notebook.

 I did a test and it seems like PySpark's basic unit-tests do pass when
 SPARK_HOME is set and Py4J is on the PYTHONPATH:


 PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
 python $SPARK_HOME/python/pyspark/rdd.py

 -Jey


 On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com
 wrote:

 This has been proposed before:
 https://issues.apache.org/jira/browse/SPARK-1267

 There's currently tighter coupling between the Python and Java halves
 of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
 we'd run into tons of issues when users try to run a newer version of the
 Python half of PySpark against an older set of Java components or
 vice-versa.

 On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com 

Re: Python UDF performance at large scale

2015-06-24 Thread Punyashloka Biswal
Hi Davies,

In general, do we expect people to use CPython only for heavyweight UDFs
that invoke an external library? Are there any examples of using Jython,
especially performance comparisons to Java/Scala and CPython? When using
Jython, do you expect the driver to send code to the executor as a string,
or is there a good way to serialized Jython lambdas?

(For context, I was unable to serialize Nashorn lambdas when I tried to use
them in Spark.)

Punya
On Wed, Jun 24, 2015 at 2:26 AM Davies Liu dav...@databricks.com wrote:

 Fare points, I also like simpler solutions.

 The overhead of Python task could be a few of milliseconds, which
 means we also should eval them as batches (one Python task per batch).

 Decreasing the batch size for UDF sounds reasonable to me, together
 with other tricks to reduce the data in socket/pipe buffer.

 BTW, what do your UDF looks like? How about to use Jython to run
 simple Python UDF (without some external libraries).

 On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang justin.u...@gmail.com
 wrote:
  // + punya
 
  Thanks for your quick response!
 
  I'm not sure that using an unbounded buffer is a good solution to the
  locking problem. For example, in the situation where I had 500 columns,
 I am
  in fact storing 499 extra columns on the java side, which might make me
 OOM
  if I have to store many rows. In addition, if I am using an
  AutoBatchedSerializer, the java side might have to write 1  16 == 65536
  rows before python starts outputting elements, in which case, the Java
 side
  has to buffer 65536 complete rows. In general it seems fragile to rely on
  blocking behavior in the Python coprocess. By contrast, it's very easy to
  verify the correctness and performance characteristics of the synchronous
  blocking solution.
 
 
  On Tue, Jun 23, 2015 at 7:21 PM Davies Liu dav...@databricks.com
 wrote:
 
  Thanks for looking into it, I'd like the idea of having
  ForkingIterator. If we have unlimited buffer in it, then will not have
  the problem of deadlock, I think. The writing thread will be blocked
  by Python process, so there will be not much rows be buffered(still be
  a reason to OOM). At least, this approach is better than current one.
 
  Could you create a JIRA and sending out the PR?
 
  On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang justin.u...@gmail.com
  wrote:
   BLUF: BatchPythonEvaluation's implementation is unusable at large
 scale,
   but
   I have a proof-of-concept implementation that avoids caching the
 entire
   dataset.
  
   Hi,
  
   We have been running into performance problems using Python UDFs with
   DataFrames at large scale.
  
   From the implementation of BatchPythonEvaluation, it looks like the
 goal
   was
   to reuse the PythonRDD code. It caches the entire child RDD so that it
   can
   do two passes over the data. One to give to the PythonRDD, then one to
   join
   the python lambda results with the original row (which may have java
   objects
   that should be passed through).
  
   In addition, it caches all the columns, even the ones that don't need
 to
   be
   processed by the Python UDF. In the cases I was working with, I had a
   500
   column table, and i wanted to use a python UDF for one column, and it
   ended
   up caching all 500 columns.
  
   I have a working solution over here that does it in one pass over the
   data,
   avoiding caching
  
   (
 https://github.com/justinuang/spark/commit/c1a415a18d31226ac580f1a9df7985571d03199b
 ).
   With this patch, I go from a job that takes 20 minutes then OOMs, to a
   job
   that finishes completely in 3 minutes. It is indeed quite hacky and
   prone to
   deadlocks since there is buffering in many locations:
  
   - NEW: the ForkingIterator LinkedBlockingDeque
   - batching the rows before pickling them
   - os buffers on both sides
   - pyspark.serializers.BatchedSerializer
  
   We can avoid deadlock by being very disciplined. For example, we can
   have
   the ForkingIterator instead always do a check of whether the
   LinkedBlockingDeque is full and if so:
  
   Java
   - flush the java pickling buffer
   - send a flush command to the python process
   - os.flush the java side
  
   Python
   - flush BatchedSerializer
   - os.flush()
  
   I haven't added this yet. This is getting very complex however.
 Another
   model would just be to change the protocol between the java side and
 the
   worker to be a synchronous request/response. This has the disadvantage
   that
   the CPU isn't doing anything when the batch is being sent across, but
 it
   has
   the huge advantage of simplicity. In addition, I imagine that the
 actual
   IO
   between the processes isn't that slow, but rather the serialization of
   java
   objects into pickled bytes, and the deserialization/serialization +
   python
   loops on the python side. Another advantage is that we won't be taking
   more
   than 100% CPU since only one thread is doing CPU work at a 

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Punyashloka Biswal
Davies: Can we use relative imports (import .types) in the unit tests in
order to disambiguate between the global and local module?

Punya

On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote:

 Thanks for clarifying! I don't understand python package and modules names
 that well, but I thought that the package namespacing would've helped,
 since you are in pyspark.sql.types. I guess not?

 On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:

 There is a module called 'types' in python 3:

 davies@localhost:~/work/spark$ python3
 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type help, copyright, credits or license for more information.
  import types
  types
 module 'types' from

 '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'

 Without renaming, our `types.py` will conflict with it when you run
 unittests in pyspark/sql/ .

 On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com
 wrote:
  In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
  renamed to pyspark/sql/_types.py and then some magic in
  pyspark/sql/__init__.py dynamically renamed the module back to types. I
  imagine that this is some naming conflict with Python 3, but what was
 the
  error that showed up?
 
  The reason why I'm asking about this is because it's messing with
 pylint,
  since pylint cannot now statically find the module. I tried also
 importing
  the package so that __init__ would be run in a init-hook, but that isn't
  what the discovery mechanism is using. I imagine it's probably just
 crawling
  the directory structure.
 
  One way to work around this would be something akin to this
  (
 http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports
 ),
  where I would have to create a fake module, but I would probably be
 missing
  a ton of pylint features on users of that module, and it's pretty hacky.




Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Punyashloka Biswal
Thanks! I realize that manipulating the published version in the pom is a
bit inconvenient but it's really useful to have clear version identifiers
when we're juggling different versions and testing them out. For example,
this will come in handy when we compare 1.4.0-rc1 and 1.4.0-rc2 in a couple
of weeks :)

Punya

On Tue, May 19, 2015 at 12:39 PM Patrick Wendell pwend...@gmail.com wrote:

 Punya,

 Let me see if I can publish these under rc1 as well. In the future
 this will all be automated but current it's a somewhat manual task.

 - Patrick

 On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal
 punya.bis...@gmail.com wrote:
  When publishing future RCs to the staging repository, would it be
 possible
  to use a version number that includes the rc1 designation? In the
 current
  setup, when I run a build against the artifacts at
 
 https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-core_2.10/1.4.0/
 ,
  my local Maven cache will get polluted with things that claim to be 1.4.0
  but aren't. It would be preferable for the version number to be 1.4.0-rc1
  instead.
 
  Thanks!
  Punya
 
 
  On Tue, May 19, 2015 at 12:20 PM Sean Owen so...@cloudera.com wrote:
 
  Before I vote, I wanted to point out there are still 9 Blockers for
 1.4.0.
  I'd like to use this status to really mean must happen before the
 release.
  Many of these may be already fixed, or aren't really blockers -- can
 just be
  updated accordingly.
 
  I bet at least one will require further work if it's really meant for
 1.4,
  so all this means is there is likely to be another RC. We should still
 kick
  the tires on RC1.
 
  (I also assume we should be extra conservative about what is merged into
  1.4 at this point.)
 
 
  SPARK-6784 SQL Clean up all the inbound/outbound conversions for
 DateType
  Adrian Wang
 
  SPARK-6811 SparkR Building binary R packages for SparkR Shivaram
  Venkataraman
 
  SPARK-6941 SQL Provide a better error message to explain that tables
  created from RDDs are immutable
  SPARK-7158 SQL collect and take return different results
  SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton
  instance of SQLContext Tathagata Das
 
  SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data
 Cheng
  Lian
 
  SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API
  Reynold Xin
 
  SPARK-7662 SQL Exception of multi-attribute generator anlysis in
  projection
 
  SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table
  scan. Yin Huai
 
 
  On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache Spark
 version
  1.4.0!
 
  The tag to be voted on is v1.4.0-rc1 (commit 777a081):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.4.0-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
 
 https://repository.apache.org/content/repositories/orgapachespark-1092/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Friday, May 22, at 17:03 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 



Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Punyashloka Biswal
Is there a foolproof way to access methods exclusively (instead of picking
between columns and methods at runtime)? Here are two ideas, neither of
which seems particularly Pythonic

   - pyspark.sql.methods(df).name()
   - df.__methods__.name()

Punya

On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 And a link to SPARK-7035
 https://issues.apache.org/jira/browse/SPARK-7035 (which
 Xiangrui mentioned in his initial email) for the lazy.

 On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote:

  On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I dont know much about Python style, but I think the point Wes made
 about
   usability on the JIRA is pretty powerful. IMHO the number of methods
 on a
   Spark DataFrame might not be much more compared to Pandas. Given that
 it
   looks like users are okay with the possibility of collisions in Pandas
 I
   think sticking (1) is not a bad idea.
  
 
  This is true for interactive work. Spark's DataFrames can handle
  really large datasets, which might be used in production workflows. So
  I think it is reasonable for us to care more about compatibility
  issues than Pandas.
 
   Also is it possible to detect such collisions in Python ? A (4)th
 option
   might be to detect that `df` contains a column named `name` and print a
   warning in `df.name` which tells the user that the method is
 overriding
  the
   column.
 
  Maybe we can inspect the frame `df.name` gets called and warn users in
  `df.select(df.name)` but not in `name = df.name`. This could be tricky
  to implement.
 
  -Xiangrui
 
  
   Thanks
   Shivaram
  
  
   On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   Hi all,
  
   In PySpark, a DataFrame column can be referenced using df[abcd]
   (__getitem__) and df.abcd (__getattr__). There is a discussion on
   SPARK-7035 on compatibility issues with the __getattr__ approach, and
   I want to collect more inputs on this.
  
   Basically, if in the future we introduce a new method to DataFrame, it
   may break user code that uses the same attr to reference a column or
   silently changes its behavior. For example, if we add name() to
   DataFrame in the next release, all existing code using `df.name` to
   reference a column called name will break. If we add `name()` as a
   property instead of a method, all existing code using `df.name` may
   still work but with a different meaning. `df.select(df.name)` no
   longer selects the column called name but the column that has the
   same name as `df.name`.
  
   There are several proposed solutions:
  
   1. Keep both df.abcd and df[abcd], and encourage users to use the
   latter that is future proof. This is the current solution in master
   (https://github.com/apache/spark/pull/5971). But I think users may be
   still unaware of the compatibility issue and prefer `df.abcd` to
   `df[abcd]` because the former could be auto-completed.
   2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
   JIRA page: I actually dragged my feet on the _getattr_ issue for
   several months back in the day, then finally added it (and tab
   completion in IPython with _dir_), and immediately noticed a huge
   quality-of-life improvement when using pandas for actual (esp.
   interactive) work.
   3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
   df[abcd] would be future proof, and df.abcd_ could be
   auto-completed. The tradeoff is apparently the extra _ appearing in
   the code.
  
   My preference is 3  1  2. Your inputs would be greatly appreciated.
   Thanks!
  
   Best,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread Punyashloka Biswal
Just curious: will docker allow new capabilities for the Spark build?
(Where can I read more?)

Punya

On Fri, May 8, 2015 at 10:00 AM shane knapp skn...@berkeley.edu wrote:

 this is happening now.

 On Thu, May 7, 2015 at 3:40 PM, shane knapp skn...@berkeley.edu wrote:

  yes, docker.  that wonderful little wrapper for linux containers will be
  installed and ready for play on all of the jenkins workers tomorrow
 morning.
 
  the downtime will be super quick:  i just need to kill the jenkins
 slaves'
  ssh connections and relaunch to add the jenkins user to the docker group.
 
  this will begin at around 7am PDT and shouldn't take long at all.
 
  shane
 



Re: [discuss] ending support for Java 6?

2015-04-30 Thread Punyashloka Biswal
I'm in favor of ending support for Java 6. We should also articulate a
policy on how long we want to support current and future versions of Java
after Oracle declares them EOL (Java 7 will be in that bucket in a matter
of days).

Punya
On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote:

 something to keep in mind:  we can easily support java 6 for the build
 environment, particularly if there's a definite EOL.

 i'd like to fix our java versioning 'problem', and this could be a big
 instigator...  right now we're hackily setting java_home in test invocation
 on jenkins, which really isn't the best.  if i decide, within jenkins, to
 reconfigure every build to 'do the right thing' WRT java version, then i
 will clean up the old mess and pay down on some technical debt.

 or i can just install java 6 and we use that as JAVA_HOME on a
 build-by-build basis.

 this will be a few days of prep and another morning-long downtime if i do
 the right thing (within jenkins), and only a couple of hours the hacky way
 (system level).

 either way, we can test on java 6.  :)

 On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote:

  nicholas started it! :)
 
  for java 6 i would have said the same thing about 1 year ago: it is
 foolish
  to drop it. but i think the time is right about now.
  about half our clients are on java 7 and the other half have active plans
  to migrate to it within 6 months.
 
  On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com
 wrote:
 
   Guys thanks for chiming in, but please focus on Java here. Python is an
   entirely separate issue.
  
  
   On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
   i am not sure eol means much if it is still actively used. we have a
 lot
   of clients with centos 5 (for which we still support python 2.4 in
 some
   form or another, fun!). most of them are on centos 6, which means
 python
   2.6. by cutting out python 2.6 you would cut out the majority of the
  actual
   clusters i am aware of. unless you intention is to truly make
 something
   academic i dont think that is wise.
  
   On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
  
   (On that note, I think Python 2.6 should be next on the chopping
 block
   sometime later this year, but that’s for another thread.)
  
   (To continue the parenthetical, Python 2.6 was in fact EOL-ed in
  October
   of
   2013. https://www.python.org/download/releases/2.6.9/)
   ​
  
   On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas 
   nicholas.cham...@gmail.com
   wrote:
  
I understand the concern about cutting out users who still use Java
  6,
   and
I don't have numbers about how many people are still using Java 6.
   
But I want to say at a high level that I support deprecating older
versions of stuff to reduce our maintenance burden and let us use
  more
modern patterns in our code.
   
Maintenance always costs way more than initial development over the
lifetime of a project, and for that reason anti-support is just
 as
important as support.
   
(On that note, I think Python 2.6 should be next on the chopping
  block
sometime later this year, but that's for another thread.)
   
Nick
   
   
On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com
   wrote:
   
This has been discussed a few times in the past, but now Oracle
 has
   ended
support for Java 6 for over a year, I wonder if we should just
 drop
   Java 6
support.
   
There is one outstanding issue Tom has brought to my attention:
   PySpark on
YARN doesn't work well with Java 7/8, but we have an outstanding
  pull
request to fix that.
   
https://issues.apache.org/jira/browse/SPARK-6869
https://issues.apache.org/jira/browse/SPARK-1920
   
   
  
  
  
  
 



Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
Jira search earlier. Is anybody working on the sub-issues yet, or is there
a design doc I should look at before taking a stab?

Regards,
Punya

On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com wrote:

 Hey Punya,

 There is some ongoing work to help make Hive upgrades more manageable
 and allow us to support multiple versions of Hive. Once we do that, it
 will be much easier for us to upgrade.

 https://issues.apache.org/jira/browse/SPARK-6906

 - Patrick

 On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  That's a lot more complicated than you might think.
 
  We've done some basic work to get HiveContext to compile against Hive
  1.1.0. Here's the code:
 
 https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec
 
  We didn't sent that upstream because that only solves half of the
  problem; the hive-thriftserver is disabled in our CDH build because it
  uses a lot of Hive APIs that have been removed in 1.1.0, so even
  getting it to compile is really complicated.
 
  If there's interest in getting the HiveContext part fixed up I can
  send a PR for that code. But at this time I don't really have plans to
  look at the thrift server.
 
 
  On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
  punya.bis...@gmail.com wrote:
  Dear Spark devs,
 
  Is there a plan for staying up-to-date with current (and future)
 versions
  of Hive? Spark currently supports version 0.13 (June 2014), but the
 latest
  version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets
 about
  updating beyond 0.13, so I was wondering if this was intentional or it
 was
  just that nobody had started work on this yet.
 
  I'd be happy to work on a PR for the upgrade if one of the core
 developers
  can tell me what pitfalls to watch out for.
 
  Punya
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Nick, I like your idea of keeping it in a separate git repository. It seems
to combine the advantages of the present Google Docs approach with the
crisper history, discoverability, and text format simplicity of GitHub
wikis.

Punya
On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I like the idea of having design docs be kept up to date and tracked in
 git.

 If the Apache repo isn't a good fit, perhaps we can have a separate repo
 just for design docs? Maybe something like
 github.com/spark-docs/spark-docs/
 ?

 If there's other stuff we want to track but haven't, perhaps we can
 generalize the purpose of the repo a bit and rename it accordingly (e.g.
 spark-misc/spark-misc).

 Nick

 On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com
 wrote:

  My only issue with Google Docs is that they're mutable, so it's difficult
  to follow a design's history through its revisions and link up JIRA
  comments with the relevant version.
 
  -Sandy
 
  On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com
  wrote:
 
  
   One thing to consider is that while docs as PDFs in JIRAs do document
 the
   original proposal, that's not the place to keep living specifications.
  That
   stuff needs to live in SCM, in a format which can be easily maintained,
  can
   generate readable documents, and, in an unrealistically ideal world,
 even
   be used by machines to validate compliance with the design. Test suites
   tend to be the implicit machine-readable part of the specification,
  though
   they aren't usually viewed as such.
  
   PDFs of word docs in JIRAs are not the place for ongoing work, even if
  the
   early drafts can contain them. Given it's just as easy to point to
  markdown
   docs in github by commit ID, that could be an alternative way to
 publish
   docs, with the document itself being viewed as one of the deliverables.
   When the time comes to update a document, then its there in the source
  tree
   to edit.
  
   If there's a flaw here, its that design docs are that: the design. The
   implementation may not match, ongoing work will certainly diverge. If
 the
   design docs aren't kept in sync, then they can mislead people.
  Accordingly,
   once the design docs are incorporated into the source tree, keeping
 them
  in
   sync with changes has be viewed as essential as keeping tests up to
 date
  
On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com
 wrote:
   
I actually don't totally see why we can't use Google Docs provided it
is clearly discoverable from the JIRA. It was my understanding that
many projects do this. Maybe not (?).
   
If it's a matter of maintaining public record on ASF infrastructure,
perhaps we can just automate that if an issue is closed we capture
 the
doc content and attach it to the JIRA as a PDF.
   
My sense is that in general the ASF infrastructure policy is becoming
more and more lenient with regards to using third party services,
provided the are broadly accessible (such as a public google doc) and
can be definitively archived on ASF controlled storage.
   
- Patrick
   
On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com
 wrote:
I know I recently used Google Docs from a JIRA, so am guilty as
charged. I don't think there are a lot of design docs in general,
 but
the ones I've seen have simply pushed docs to a JIRA. (I did the
 same,
mirroring PDFs of the Google Doc.) I don't think this is hard to
follow.
   
I think you can do what you like: make a JIRA and attach files.
 Make a
WIP PR and attach your notes. Make a Google Doc if you're feeling
transgressive.
   
I don't see much of a problem to solve here. In practice there are
plenty of workable options, all of which are mainstream, and so I do
not see an argument that somehow this is solved by letting people
 make
wikis.
   
On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
Okay, I can understand wanting to keep Git history clean, and avoid
bottlenecking on committers. Is it reasonable to establish a
   convention of
having a label, component or (best of all) an issue type for issues
   that are
associated with design docs? For example, if we used the existing
Brainstorming issue type, and people put their design doc in the
description of the ticket, it would be relatively easy to figure
 out
   what
designs are in progress.
   
Given the push-back against design docs in Git or on the wiki and
 the
   strong
preference for keeping docs on ASF property, I'm a bit surprised
 that
   all
the existing design docs are on Google Docs. Perhaps Apache should
   consider
opening up parts of the wiki to a larger group, to better serve
 this
   use
case.
   
Punya
   
On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell 
 pwend...@gmail.com
   wrote

Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Dear Spark devs,

Is there a plan for staying up-to-date with current (and future) versions
of Hive? Spark currently supports version 0.13 (June 2014), but the latest
version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
updating beyond 0.13, so I was wondering if this was intentional or it was
just that nobody had started work on this yet.

I'd be happy to work on a PR for the upgrade if one of the core developers
can tell me what pitfalls to watch out for.

Punya


Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Github's wiki is just another Git repo. If we use a separate repo, it's
probably easiest to use the wiki git repo rather than the primary git
repo.

Punya

On Mon, Apr 27, 2015 at 1:50 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Oh, a GitHub wiki (which is separate from having docs in a repo) is yet
 another approach we could take, though if we want to do that on the main
 Spark repo we'd need permission from Apache, which may be tough to get...

 On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 Nick, I like your idea of keeping it in a separate git repository. It
 seems to combine the advantages of the present Google Docs approach with
 the crisper history, discoverability, and text format simplicity of GitHub
 wikis.

 Punya
 On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I like the idea of having design docs be kept up to date and tracked in
 git.

 If the Apache repo isn't a good fit, perhaps we can have a separate repo
 just for design docs? Maybe something like
 github.com/spark-docs/spark-docs/
 ?

 If there's other stuff we want to track but haven't, perhaps we can
 generalize the purpose of the repo a bit and rename it accordingly (e.g.
 spark-misc/spark-misc).

 Nick

 On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com
 wrote:

  My only issue with Google Docs is that they're mutable, so it's
 difficult
  to follow a design's history through its revisions and link up JIRA
  comments with the relevant version.
 
  -Sandy
 
  On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran 
 ste...@hortonworks.com
  wrote:
 
  
   One thing to consider is that while docs as PDFs in JIRAs do
 document the
   original proposal, that's not the place to keep living
 specifications.
  That
   stuff needs to live in SCM, in a format which can be easily
 maintained,
  can
   generate readable documents, and, in an unrealistically ideal world,
 even
   be used by machines to validate compliance with the design. Test
 suites
   tend to be the implicit machine-readable part of the specification,
  though
   they aren't usually viewed as such.
  
   PDFs of word docs in JIRAs are not the place for ongoing work, even
 if
  the
   early drafts can contain them. Given it's just as easy to point to
  markdown
   docs in github by commit ID, that could be an alternative way to
 publish
   docs, with the document itself being viewed as one of the
 deliverables.
   When the time comes to update a document, then its there in the
 source
  tree
   to edit.
  
   If there's a flaw here, its that design docs are that: the design.
 The
   implementation may not match, ongoing work will certainly diverge.
 If the
   design docs aren't kept in sync, then they can mislead people.
  Accordingly,
   once the design docs are incorporated into the source tree, keeping
 them
  in
   sync with changes has be viewed as essential as keeping tests up to
 date
  
On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com
 wrote:
   
I actually don't totally see why we can't use Google Docs provided
 it
is clearly discoverable from the JIRA. It was my understanding that
many projects do this. Maybe not (?).
   
If it's a matter of maintaining public record on ASF
 infrastructure,
perhaps we can just automate that if an issue is closed we capture
 the
doc content and attach it to the JIRA as a PDF.
   
My sense is that in general the ASF infrastructure policy is
 becoming
more and more lenient with regards to using third party services,
provided the are broadly accessible (such as a public google doc)
 and
can be definitively archived on ASF controlled storage.
   
- Patrick
   
On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com
 wrote:
I know I recently used Google Docs from a JIRA, so am guilty as
charged. I don't think there are a lot of design docs in general,
 but
the ones I've seen have simply pushed docs to a JIRA. (I did the
 same,
mirroring PDFs of the Google Doc.) I don't think this is hard to
follow.
   
I think you can do what you like: make a JIRA and attach files.
 Make a
WIP PR and attach your notes. Make a Google Doc if you're feeling
transgressive.
   
I don't see much of a problem to solve here. In practice there are
plenty of workable options, all of which are mainstream, and so I
 do
not see an argument that somehow this is solved by letting people
 make
wikis.
   
On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
Okay, I can understand wanting to keep Git history clean, and
 avoid
bottlenecking on committers. Is it reasonable to establish a
   convention of
having a label, component or (best of all) an issue type for
 issues
   that are
associated with design docs? For example, if we used the existing
Brainstorming issue type, and people put their design doc

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Punyashloka Biswal
Okay, I can understand wanting to keep Git history clean, and avoid
bottlenecking on committers. Is it reasonable to establish a convention of
having a label, component or (best of all) an issue type for issues that
are associated with design docs? For example, if we used the existing
Brainstorming issue type, and people put their design doc in the
description of the ticket, it would be relatively easy to figure out what
designs are in progress.

Given the push-back against design docs in Git or on the wiki and the
strong preference for keeping docs on ASF property, I'm a bit surprised
that all the existing design docs are on Google Docs. Perhaps Apache should
consider opening up parts of the wiki to a larger group, to better serve
this use case.

Punya

On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote:

 Using our ASF git repository as a working area for design docs, it
 seems potentially concerning to me. It's difficult process wise
 because all commits need to go through committers and also, we'd
 pollute our git history a lot with random incremental design updates.

 The git history is used a lot by downstream packagers, us during our
 QA process, etc... we really try to keep it oriented around code
 patches:

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog

 Committing a polished design doc along with a feature, maybe that's
 something we could consider. But I still think JIRA is the best
 location for these docs, consistent with what most other ASF projects
 do that I know.

 On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger c...@koeninger.org
 wrote:
  Why can't pull requests be used for design docs in Git if people who
 aren't
  committers want to contribute changes (as opposed to just comments)?
 
  On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com wrote:
 
  Only catch there is it requires commit access to the repo. We need a
  way for people who aren't committers to write and collaborate (for
  point #1)
 
  On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal
  punya.bis...@gmail.com wrote:
   Sandy, doesn't keeping (in-progress) design docs in Git satisfy the
  history
   requirement? Referring back to my Gradle example, it seems that
  
 
 https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md
   is a really good way to see why the design doc evolved the way it did.
  When
   keeping the doc in Jira (presumably as an attachment) it's not easy to
  see
   what changed between successive versions of the doc.
  
   Punya
  
   On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com
  wrote:
  
   I think there are maybe two separate things we're talking about?
  
   1. Design discussions and in-progress design docs.
  
   My two cents are that JIRA is the best place for this.  It allows
  tracking
   the progression of a design across multiple PRs and contributors.  A
  piece
   of useful feedback that I've gotten in the past is to make design
 docs
   immutable.  When updating them in response to feedback, post a new
  version
   rather than editing the existing one.  This enables tracking the
  history of
   a design and makes it possible to read comments about previous
 designs
  in
   context.  Otherwise it's really difficult to understand why
 particular
   approaches were chosen or abandoned.
  
   2. Completed design docs for features that we've implemented.
  
   Perhaps less essential to project progress, but it would be really
  lovely
   to have a central repository to all the projects design doc.  If
 anyone
   wants to step up to maintain it, it would be cool to have a wiki page
  with
   links to all the final design docs posted on JIRA.
  
 



Re: Design docs: consolidation and discoverability

2015-04-24 Thread Punyashloka Biswal
The Gradle dev team keep their design documents  *checked into* their Git
repository -- see
https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md
for example. The advantages I see to their approach are:

   - design docs stay on ASF property (since Github is synced to the
   Apache-run Git repository)
   - design docs have a lifetime across PRs, but can still be modified and
   commented on through the mechanism of PRs
   - keeping a central location helps people to find good role models and
   converge on conventions

Sean, I find it hard to use the central Jira as a jumping-off point for
understanding ongoing design work because a tiny fraction of the tickets
actually relate to design docs, and it's not easy from the outside to
figure out which ones are relevant.

Punya

On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote:

 I think it's OK to have design discussions on github, as emails go to
 ASF lists. After all, loads of PR discussions happen there. It's easy
 for anyone to follow.

 I also would rather just discuss on Github, except for all that noise.

 It's not great to put discussions in something like Google Docs
 actually; the resulting doc needs to be pasted back to JIRA promptly
 if so. I suppose it's still better than a private conversation or not
 talking at all, but the principle is that one should be able to access
 any substantive decision or conversation by being tuned in to only the
 project systems of record -- mailing list, JIRA.



 On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com wrote:
  I'd love to see more design discussions consolidated in a single place as
  well. That said, there are many practical challenges to overcome. Some of
  them are out of our control:
 
  1. For large features, it is fairly common to open a PR for discussion,
  close the PR taking some feedback into account, and reopen another one.
 You
  sort of lose the discussions that way.
 
  2. With the way Jenkins is setup currently, Jenkins testing introduces a
 lot
  of noise to GitHub pull requests, making it hard to differentiate
 legitimate
  comments from noise. This is unfortunately due to the fact that ASF won't
  allow our Jenkins bot to have API privilege to post messages.
 
  3. The Apache Way is that all development discussions need to happen on
 ASF
  property, i.e. dev lists and JIRA. As a result, technically we are not
  allowed to have development discussions on GitHub.
 
 
  On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org
 wrote:
 
  My 2 cents - I'd rather see design docs in github pull requests (using
  plain text / markdown).  That doesn't require changing access or adding
  people, and github PRs already allow for conversation / email
  notifications.
 
  Conversation is already split between jira and github PRs.  Having a
 third
  stream of conversation in Google Docs just leads to things being
 ignored.
 
  On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com wrote:
 
   That would require giving wiki access to everyone or manually adding
   people
   any time they make a doc.
  
   I don't see how this helps though. They're still docs on the internet
   and
   they're still linked from the central project JIRA, which is what you
   should follow.
On Apr 24, 2015 8:14 AM, Punyashloka Biswal 
 punya.bis...@gmail.com
   wrote:
  
Dear Spark devs,
   
Right now, design docs are stored on Google docs and linked from
tickets.
For someone new to the project, it's hard to figure out what
 subjects
are
being discussed, what organization to follow for new feature
proposals,
etc.
   
Would it make sense to consolidate future design docs in either a
designated area on the Apache Confluence Wiki, or on GitHub's Wiki
pages?
If people have a strong preference to keep the design docs on Google
   Docs,
then could we have a top-level page on the confluence wiki that
 lists
all
active and archived design docs?
   
Punya
   
  
 
 



Re: Design docs: consolidation and discoverability

2015-04-24 Thread Punyashloka Biswal
Sandy, doesn't keeping (in-progress) design docs in Git satisfy the history
requirement? Referring back to my Gradle example, it seems that
https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md
is a really good way to see why the design doc evolved the way it did. When
keeping the doc in Jira (presumably as an attachment) it's not easy to see
what changed between successive versions of the doc.

Punya

On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com wrote:

 I think there are maybe two separate things we're talking about?

 1. Design discussions and in-progress design docs.

 My two cents are that JIRA is the best place for this.  It allows tracking
 the progression of a design across multiple PRs and contributors.  A piece
 of useful feedback that I've gotten in the past is to make design docs
 immutable.  When updating them in response to feedback, post a new version
 rather than editing the existing one.  This enables tracking the history of
 a design and makes it possible to read comments about previous designs in
 context.  Otherwise it's really difficult to understand why particular
 approaches were chosen or abandoned.

 2. Completed design docs for features that we've implemented.

 Perhaps less essential to project progress, but it would be really lovely
 to have a central repository to all the projects design doc.  If anyone
 wants to step up to maintain it, it would be cool to have a wiki page with
 links to all the final design docs posted on JIRA.

 -Sandy

 On Fri, Apr 24, 2015 at 12:01 PM, Punyashloka Biswal 
 punya.bis...@gmail.com wrote:

 The Gradle dev team keep their design documents  *checked into* their Git


 repository -- see

 https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md
 for example. The advantages I see to their approach are:

- design docs stay on ASF property (since Github is synced to the
Apache-run Git repository)
- design docs have a lifetime across PRs, but can still be modified and


commented on through the mechanism of PRs

- keeping a central location helps people to find good role models and


converge on conventions

 Sean, I find it hard to use the central Jira as a jumping-off point for
 understanding ongoing design work because a tiny fraction of the tickets
 actually relate to design docs, and it's not easy from the outside to
 figure out which ones are relevant.

 Punya

 On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote:

  I think it's OK to have design discussions on github, as emails go to
  ASF lists. After all, loads of PR discussions happen there. It's easy
  for anyone to follow.
 
  I also would rather just discuss on Github, except for all that noise.
 
  It's not great to put discussions in something like Google Docs
  actually; the resulting doc needs to be pasted back to JIRA promptly
  if so. I suppose it's still better than a private conversation or not
  talking at all, but the principle is that one should be able to access
  any substantive decision or conversation by being tuned in to only the
  project systems of record -- mailing list, JIRA.
 
 
 
  On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com
 wrote:
   I'd love to see more design discussions consolidated in a single
 place as
   well. That said, there are many practical challenges to overcome.
 Some of
   them are out of our control:
  
   1. For large features, it is fairly common to open a PR for
 discussion,
   close the PR taking some feedback into account, and reopen another
 one.
  You
   sort of lose the discussions that way.
  
   2. With the way Jenkins is setup currently, Jenkins testing
 introduces a
  lot
   of noise to GitHub pull requests, making it hard to differentiate
  legitimate
   comments from noise. This is unfortunately due to the fact that ASF
 won't
   allow our Jenkins bot to have API privilege to post messages.
  
   3. The Apache Way is that all development discussions need to happen
 on
  ASF
   property, i.e. dev lists and JIRA. As a result, technically we are not
   allowed to have development discussions on GitHub.
  
  
   On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org
  wrote:
  
   My 2 cents - I'd rather see design docs in github pull requests
 (using
   plain text / markdown).  That doesn't require changing access or
 adding
   people, and github PRs already allow for conversation / email
   notifications.
  
   Conversation is already split between jira and github PRs.  Having a
  third
   stream of conversation in Google Docs just leads to things being
  ignored.
  
   On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com
 wrote:
  
That would require giving wiki access to everyone or manually
 adding
people
any time they make a doc.
   
I don't see how this helps though. They're still docs on the
 internet
and
they're still linked from the central project JIRA

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Punyashloka Biswal
Thanks for the pointers! It looks like others are pretty active on this so
I'll comment on those PRs and try to coordinate before starting any new
work.

Punya
On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 ​There were some PR's about graphical representation with D3.js, you can
 possibly see it on the github. Here's a few of them
 https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​

 Thanks
 Best Regards

 On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal 
 punya.bis...@gmail.com wrote:

 Dear Spark devs,

 Would people find it useful to have a graphical display of metrics (such
 as
 duration, GC time, etc) on the application UI page? Has anybody worked on
 this before?

 Punya





Graphical display of metrics on application UI page

2015-04-21 Thread Punyashloka Biswal
Dear Spark devs,

Would people find it useful to have a graphical display of metrics (such as
duration, GC time, etc) on the application UI page? Has anybody worked on
this before?

Punya


Re: [discuss] new Java friendly InputSource API

2015-04-21 Thread Punyashloka Biswal
Reynold, thanks for this! At Palantir we're heavy users of the Java APIs
and appreciate being able to stop hacking around with fake ClassTags :)

Regarding this specific proposal, is the contract of RecordReader#get
intended to be that it returns a fresh object each time? Or is it allowed
to mutate a fixed object and return a pointer to it each time?

Put another way, is a caller supposed to clone the output of get() if they
want to use it later?

Punya
On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin r...@databricks.com wrote:

 I created a pull request last night for a new InputSource API that is
 essentially a stripped down version of the RDD API for providing data into
 Spark. Would be great to hear the community's feedback.

 Spark currently has two de facto input source API:
 1. RDD
 2. Hadoop MapReduce InputFormat

 Neither of the above is ideal:

 1. RDD: It is hard for Java developers to implement RDD, given the implicit
 class tags. In addition, the RDD API depends on Scala's runtime library,
 which does not preserve binary compatibility across Scala versions. If a
 developer chooses Java to implement an input source, it would be great if
 that input source can be binary compatible in years to come.

 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive.
 For example, it forces key-value semantics, and does not support running
 arbitrary code on the driver side (an example of why this is useful is
 broadcast). In addition, it is somewhat awkward to tell developers that in
 order to implement an input source for Spark, they should learn the Hadoop
 MapReduce API first.


 My patch creates a new InputSource interface, described by:

 - an array of InputPartition that specifies the data partitioning
 - a RecordReader that specifies how data on each partition can be read

 This interface is similar to Hadoop's InputFormat, except that there is no
 explicit key/value separation.


 JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025
 Pull request: https://github.com/apache/spark/pull/5603