Re: is there a way to persist the lineages generated by spark?

2017-04-03 Thread ayan guha
How about storing logical plans (or printDebugString, in case of RDD) to an external file on the driver? On Tue, Apr 4, 2017 at 1:19 PM, kant kodali wrote: > Hi All, > > I am wondering if there a way to persist the lineages generated by spark > underneath? Some of our

is there a way to persist the lineages generated by spark?

2017-04-03 Thread kant kodali
Hi All, I am wondering if there a way to persist the lineages generated by spark underneath? Some of our clients want us to prove if the result of the computation that we are showing on a dashboard is correct and for that If we can show the lineage of transformations that are executed to get to

map transform on array in spark sql

2017-04-03 Thread Koert Kuipers
i have a DataFrame where one column has type: ArrayType(StructType(Seq( StructField("a", typeA, nullableA), StructField("b", typeB, nullableB) ))) i would like to map over this array to pick the first element in the struct. so the result should be a ArrayType(typeA, nullableA). i realize i

Do we support excluding the CURRENT ROW in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is an example to illustrate my question. In this toy example, we are collecting a list of the other products that each user has bought, and appending it as a new column. (Also note, that we are filtering on some arbitrary column 'good_bad'.) I would like to know if we support NOT

Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay
What do you want to do with the results of the query? Henry On Wed, Mar 29, 2017 at 12:00 PM, szep.laszlo.it wrote: > Hi, > > after I created a dataset > > Dataset df = sqlContext.sql("query"); > > I need to have a result values and I call a method: collectAsList() >

Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay
So if I am understanding your problem, you have the data in CSV files, but the CSV files are gunzipped? If so Spark can read a gunzip file directly. Sorry if I didn't understand your question. Henry On Mon, Apr 3, 2017 at 5:05 AM, Old-School wrote: > I have a

_SUCCESS file validation on read

2017-04-03 Thread drewrobb
When writing a dataframe, a _SUCCESS file is created to mark that the entire dataframe is written. However, the existence of this _SUCCESS does not seem to be validated by default on reads. This would allow in some cases for partially written dataframes to be read back. Is this behavior

Re: Convert Dataframe to Dataset in pyspark

2017-04-03 Thread Michael Armbrust
You don't need encoders in python since its all dynamically typed anyway. You can just do the following if you want the data as a string. sqlContext.read.text("/home/spark/1.6/lines").rdd.map(lambda row: row.value) 2017-04-01 5:36 GMT-07:00 Selvam Raman : > In Scala, > val ds

Pyspark - pickle.PicklingError: Can't pickle

2017-04-03 Thread Selvam Raman
I ran the below code in my Standalone mode. Python version 2.7.6. Spacy 1.7+ version. Spark 2.0.1 version. I'm new pie to pyspark. please help me to understand the below two versions of code. why first version run fine whereas second throws pickle.PicklingError: Can't pickle . at 0x107e39110>.

Executor unable to pick postgres driver in Spark standalone cluster

2017-04-03 Thread Rishikesh Teke
Hi all, I was submitting the play application to spark 2.1 standalone cluster . In play application postgres dependency is also added and application works on local spark libraries. But at run time on standalone cluster it gives me error : o.a.s.s.TaskSetManager - Lost task 0.0 in stage 0.0

Read file and represent rows as Vectors

2017-04-03 Thread Old-School
I have a dataset that contains DocID, WordID and frequency (count) as shown below. Note that the first three numbers represent 1. the number of documents, 2. the number of words in the vocabulary and 3. the total number of words in the collection. 189 1430 12300 1 2 1 1 39 1 1 42 3 1 77 1 1 95 1

Re: Benchmarking streaming frameworks

2017-04-03 Thread Alonso Isidoro Roman
I remember that yahoo did something similar... https://github.com/yahoo/streaming-benchmarks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-04-03 9:41

Re: Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is a stackoverflow link: https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions -- View this message in

Re: Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
I am not sure why, but the mailing list is saying. "This post has NOT been accepted by the mailing list yet". On Mon, 3 Apr 2017 at 20:52 mathewwicks [via Apache Spark User List] < ml-node+s1001560n28558...@n3.nabble.com> wrote: > Here is an example to illustrate my point. > > In this toy

Do we support excluding the current row in PARTITION BY windowing functions?

2017-04-03 Thread mathewwicks
Here is an example to illustrate my point. In this toy example, we are collecting a list of the other products that each user has bought, and appending it as a new column. (Also note, that we are filtering on some arbitrary column 'good_bad'.) I would like to know if we support NOT including

Re: Does Apache Spark use any Dependency Injection framework?

2017-04-03 Thread Jacek Laskowski
Hi, Answering your question from the title (that seems different from what's in the email) and leaving the other part of how to do it using a DI framework to others. Spark does not use any DI framework internally and wires components itself. Jacek On 2 Apr 2017 3:29 p.m., "kant kodali"

Benchmarking streaming frameworks

2017-04-03 Thread gvdongen
Dear users of Streaming Technologies, As a PhD student in big data analytics, I am currently in the process of compiling a list of benchmarks (to test multiple streaming frameworks) in order to create an expanded benchmarking suite. The benchmark suite is being developed as a part of my current

Benchmarking streaming frameworks

2017-04-03 Thread gvdongen
Dear users of Streaming Technologies, As a PhD student in big data analytics, I am currently in the process of compiling a list of benchmarks (to test multiple streaming frameworks) in order to create an expanded benchmarking suite. The benchmark suite is being developed as a part of my current

Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-03 Thread Weiqing Yang
Thanks for sharing this. On Sun, Apr 2, 2017 at 7:08 PM, Irving Duran wrote: > Thanks for the share! > > > Thank You, > > Irving Duran > > On Sun, Apr 2, 2017 at 7:19 PM, Felix Cheung > wrote: > >> Interesting! >> >>