Hi Patrick,

I don't mean to be glib, but the fact that it works at all on my cluster (600 
nodes) and data is a novel experience. This is the first release that I haven't 
had to struggle with and then give up entirely. I can , for example, finally 
use HiveContext from PySpark on CDH, at least to read data. 

That said, there are plenty of opportunities in 1.3, not that you asked. 
*smile*:

I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty 
despite my calling cache(). 

I think that attempts to access a directory of parquet files still requires 
reading the schema from the footer of every file. Painfully slow for terabytes 
of data. 

Exceptions are still often reflective of a symptom rather than a root cause. 
For example, I had a join that was blowing up, but it was variously reported as 
insufficient kryro buffers and even an AST error in the SQL parser. 

Saving an SRDD to a table in Hive doesn't work. I had to sneak it in by saving 
to a file and the creating an external table. 

In interactive work, it would be nice if I could interrupt the current job 
without killing the whole session. The lower latency potential of Sparrow is 
also very intriguing. 

Getting GraphX for PySpark would be very welcome. 

It's easy to find fault, of course. I do want to say again how grateful I am to 
have a usable release in 1.2 and look forward to 1.3 and beyond with real 
excitement. 

----
Eric Friedman

> On Dec 28, 2014, at 5:40 PM, Patrick Wendell <pwend...@gmail.com> wrote:
> 
> Hey Eric,
> 
> I'm just curious - which specific features in 1.2 do you find most
> help with usability? This is a theme we're focusing on for 1.3 as
> well, so it's helpful to hear what makes a difference.
> 
> - Patrick
> 
> On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
>> Hi Josh,
>> 
>> Thanks for the informative answer. Sounds like one should await your changes
>> in 1.3. As information, I found the following set of options for doing the
>> visual in a notebook.
>> 
>> http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/Animations_and_Progress.ipynb
>> 
>> 
>> On Dec 27, 2014, at 4:07 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>> 
>> The console progress bars are implemented on top of a new stable "status
>> API" that was added in Spark 1.2.  It's possible to query job progress using
>> this interface (in older versions of Spark, you could implement a custom
>> SparkListener and maintain the counts of completed / running / failed tasks
>> / stages yourself).
>> 
>> There are actually several subtleties involved in implementing "job-level"
>> progress bars which behave in an intuitive way; there's a pretty extensive
>> discussion of the challenges at https://github.com/apache/spark/pull/3009.
>> Also, check out the pull request for the console progress bars for an
>> interesting design discussion around how they handle parallel stages:
>> https://github.com/apache/spark/pull/3029.
>> 
>> I'm not sure about the plumbing that would be necessary to display live
>> progress updates in the IPython notebook UI, though.  The general pattern
>> would probably involve a mapping to relate notebook cells to Spark jobs (you
>> can do this with job groups, I think), plus some periodic timer that polls
>> the driver for the status of the current job in order to update the progress
>> bar.
>> 
>> For Spark 1.3, I'm working on designing a REST interface to accesses this
>> type of job / stage / task progress information, as well as expanding the
>> types of information exposed through the stable status API interface.
>> 
>> - Josh
>> 
>> On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman <eric.d.fried...@gmail.com>
>> wrote:
>>> 
>>> Spark 1.2.0 is SO much more usable than previous releases -- many thanks
>>> to the team for this release.
>>> 
>>> A question about progress of actions.  I can see how things are
>>> progressing using the Spark UI.  I can also see the nice ASCII art animation
>>> on the spark driver console.
>>> 
>>> Has anyone come up with a way to accomplish something similar in an
>>> iPython notebook using pyspark?
>>> 
>>> Thanks
>>> Eric
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to