Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Georg Heiler
Looking forward to the blog post. Thanks for for pointing me to some of the simpler classes. Nick Pentreath schrieb am Fr. 18. Nov. 2016 um 02:53: > @Holden look forward to the blog post - I think a user guide PR based on > it would also be super useful :) > > > On Fri,

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month  > > On Nov 17,

Re: [build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread Reynold Xin
Thanks for the headsup, Shane. On Thu, Nov 17, 2016 at 2:33 PM, shane knapp wrote: > TL;DR: amplab is becomine riselab, and is much more C++ oriented. > centos 6 is so far behind, and i'm already having to roll C++ > compilers and various libraries by hand. centos 7 is

[build system] massive jenkins infrastructure changes forthcoming

2016-11-17 Thread shane knapp
TL;DR: amplab is becomine riselab, and is much more C++ oriented. centos 6 is so far behind, and i'm already having to roll C++ compilers and various libraries by hand. centos 7 is an absolute no-go, so we'll be moving the jenkins workers over to a recent (TBD) version of ubuntu server. also,

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-18495 On Thu, Nov 17, 2016 at 12:23 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Nice catch Suhas, and thanks for the reference. Sounds like we need a > tweak to the UI so this little feature is self-documenting. > > Will file a JIRA,

Re: issues with github pull request notification emails missing

2016-11-17 Thread Xiao Li
Just FYI, normally, when we ping a people, the github can show the full name after we type the github id. Below is an example: [image: 内嵌图片 2] Starting from last week, Reynold's full name is not shown. Does github update their hash functions? [image: 内嵌图片 1] Thanks, Xiao Li 2016-11-16

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Holden Karau
I've been working on a blog post around this and hope to have it published early next month  On Nov 17, 2016 10:16 PM, "Joseph Bradley" wrote: Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley
Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenizer ml.regression.IsotonicRegression You should not need to put your library in Spark's namespace. The shared Params in SPARK-7146 are not

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread Reynold Xin
Adding a new data type is an enormous undertaking and very invasive. I don't think it is worth it in this case given there are clear, simple workarounds. On Thu, Nov 17, 2016 at 12:24 PM, kant kodali wrote: > Can we have a JSONType for Spark SQL? > > On Wed, Nov 16, 2016 at

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread kant kodali
Can we have a JSONType for Spark SQL? On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande wrote: > If you are dealing with a bunch of different schemas in 1 field, figuring > out a strategy to deal with that will depend on your data and does not > really have anything to do

Jackson Spark/app incompatibility and how to resolve it

2016-11-17 Thread Michael Allman
Hello, I'm running into an issue with a Spark app I'm building, which depends on a library which depends on Jackson 2.8, which fails at runtime because Spark brings in Jackson 2.6. I'm looking for a solution. As a workaround, I've patched our build of Spark to use Jackson 2.8. That's working,

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Nice catch Suhas, and thanks for the reference. Sounds like we need a tweak to the UI so this little feature is self-documenting. Will file a JIRA, unless someone else wants to take this one and file the JIRA themselves. On Thu, Nov 17, 2016 at 12:21 PM Suhas Gaddam

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Suhas Gaddam
"Second, one of the RDDs is cached in the first stage (denoted by the green highlight). Since the enclosing operation involves reading from HDFS, caching this RDD means future computations on this RDD can access at least a subset of the original file from memory instead of from HDFS." from

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Reynold Xin
Ha funny. Never noticed that. On Thursday, November 17, 2016, Nicholas Chammas wrote: > Hmm... somehow the image didn't show up. > > How about now? > > [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] > > On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Hmm... somehow the image didn't show up. How about now? [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: Should I be able to see something? On Thu, Nov 17, 2016 at 9:10 AM, Nicholas

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Herman van Hövell tot Westerflier
Should I be able to see something? On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Some questions about this DAG visualization: > > [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] > > 1. What's the meaning of the green dot? > 2. Should this be

Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Some questions about this DAG visualization: [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] 1. What's the meaning of the green dot? 2. Should this be documented anywhere (if it isn't already)? Preferably a tooltip or something directly in the UI would explain the significance. Nick

Re: structured streaming and window functions

2016-11-17 Thread HENSLEE, AUSTIN L
Forgive the slight tangent… For anyone following this thread who may be wondering about a quick, simple solution they can apply (and a walk-through on how) for more straight-forward sessionization needs: There’s a nice section on sessionization in “Advanced Analytics with Spark”, by Ryza,

Fwd: SparkILoop doesn't run

2016-11-17 Thread Mohit Jaggi
I am trying to use SparkILoop to write some tests(shown below) but the test hangs with the following stack trace. Any idea what is going on? import org.apache.log4j.{Level, LogManager} import org.apache.spark.repl.SparkILoop import org.scalatest.{BeforeAndAfterAll, FunSuite} class SparkReplSpec

Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
I agree with you, I think that once we will have sessionization, we could aim for richer processing capabilities per session. As far as I image it, a session is an ordered sequence of data, that we could apply computation on it (like CEP). Ofir Manor Co-Founder & CTO | Equalum Mobile:

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
It is true that this is sessionizing but I brought it as an example for finding an ordered pattern in the data. In general, using simple window (e.g. 24 hours) in structured streaming is explain in the grouping by time and is very clear. What I was trying to figure out is how to do streaming of

Re: structured streaming and window functions

2016-11-17 Thread Ofir Manor
Assaf, I think what you are describing is actually sessionizing, by user, where a session is ended by a successful login event. On each session, you want to count number of failed login events. If so, this is tracked by https://issues.apache.org/jira/browse/SPARK-10816 (didn't start yet) Ofir

RE: structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Is there a plan to support sql window functions? I will give an example of use: Let’s say we have login logs. What we want to do is for each user we would want to add the number of failed logins for each successful login. How would you do it with structured streaming? As this is currently not

Re: structured streaming and window functions

2016-11-17 Thread Herman van Hövell tot Westerflier
What kind of window functions are we talking about? Structured streaming only supports time window aggregates, not the more general sql window function (sum(x) over (partition by ... order by ...)) aggregates. The basic idea is that you use incremental aggregation and store the aggregation buffer

Re: Another Interesting Question on SPARK SQL

2016-11-17 Thread Herman van Hövell tot Westerflier
The diagram you have included, is a depiction of the steps Catalyst (the spark optimizer) takes to create an executable plan. Tungsten mainly comes into play during code generation and the actual execution. A datasource is represented by a LogicalRelation during analysis & optimization. The spark

Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​ Which parts in the diagram above are executed by DataSource connectors and which parts are executed by Tungsten? or to put it in another way which phase in the diagram above does Tungsten leverages the Datasource connectors (such as say cassandra connector ) ? My understanding so far is that

structured streaming and window functions

2016-11-17 Thread assaf.mendelson
Hi, I have been trying to figure out how structured streaming handles window functions efficiently. The portion I understand is that whenever new data arrived, it is grouped by the time and the aggregated data is added to the state. However, unlike operations like sum etc. window functions need