WrappedArray to row of relational Db

2017-04-25 Thread vaibhavrtk
I have nested structure which i read from an xml using spark-Xml. I want to use spark sql to convert this nested structure to different relational tables (WrappedArray([WrappedArray([[null,592006340,null],null,BA,M,1724]),N,2017-04-05T16:31:03,586257528),659925562) which has a schema: StructType(

[ann] Release of TensorFrames 0.2.8

2017-04-25 Thread Tim Hunter
Hello all, I would like to bring to your attention the (long overdue) release of a new version of TensorFrames. Thank you to all people who have reported some packaging and installation issues. This release fixes a large number of performance and stability problems, and brings a few improvements.

weird error message

2017-04-25 Thread Afshin, Bardia
I’m having issues when I fire up pyspark on a fresh install. When I submit the same process via spark-submit it works. Here’s a dump of the trace: at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Nati

Re: Spark Testing Library Discussion

2017-04-25 Thread lucas.g...@gmail.com
Hi all, whoever (Sam I think) was going to do some work on doing a template testing pipeline. I'd love to be involved, I have a current task in my day job (data engineer) to flesh out our testing how-to / best practices for Spark jobs and I think I'll be doing something very similar for the next w

Re: Spark Testing Library Discussion

2017-04-25 Thread Holden Karau
Urgh hangouts did something frustrating, updated link https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau wrote: > The (tentative) link for those interested is https://hangouts.google. > com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue . >

Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-25 Thread Dominik Safaric
Hi all, Because the Spark Streaming direct Kafka consumer maps offsets for a given Kafka topic and a partition internally while having enable.auto.commit set to false, how can I retrieve the offset of each made consumer’s poll call using the offset ranges of an RDD? More precisely, the informat

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
Still not working. Seems like there's some syntax error. from pyspark.sql.functions import udf start_date_test2.withColumn("diff", datediff(start_date_test2.start_date, start_date_test2.holiday.getItem[0])).show() ---TypeErro

spark streaming resiliency

2017-04-25 Thread vincent gromakowski
Hi, I have a question regarding Spark streaming resiliency and the documentation is ambiguous : The documentation says that the default configuration use a replication factor of 2 for data received but the recommendation is to use write ahead logs to guarantee data resiliency with receivers. "Add

Re: how to find the nearest holiday

2017-04-25 Thread Pushkar.Gujar
​You can use ​- start_date_test2.holiday.getItem[0] ​I would highly suggest you to look at latest documentation - http://spark.apache.org/docs/latest/api/python/index.html ​ Thank you, *Pushkar Gujar* On Tue, Apr 25, 2017 at 8:50 AM, Zeming Yu wrote: > How could I access the first element of

Re: Authorizations in thriftserver

2017-04-25 Thread vincent gromakowski
Does someone have the answer ? 2017-04-24 9:32 GMT+02:00 vincent gromakowski : > Hi, > Can someone confirm authorizations aren't implemented in Spark > thriftserver for SQL standard based hive authorizations? > https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+ > Authorizat

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
How could I access the first element of the holiday column? I tried the following code, but it doesn't work: start_date_test2.withColumn("diff", datediff(start_date_test2.start_date, start_date_test2.holiday*[0]*)).show() On Tue, Apr 25, 2017 at 10:20 PM, Zeming Yu wrote: > Got it working now!

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
Got it working now! Does anyone have a pyspark example of how to calculate the numbers of days from the nearest holiday based on an array column? I.e. from this table +--+---+ |start_date|holiday| +--+---+ |2017-08-11|[2017-

Re: pyspark vector

2017-04-25 Thread Nick Pentreath
Well the 3 in this case is the size of the sparse vector. This equates to the number of features, which for CountVectorizer (I assume that's what you're using) is also vocab size (number of unique terms). On Tue, 25 Apr 2017 at 04:06 Peyman Mohajerian wrote: > setVocabSize > > > On Mon, Apr 24,

Re: how to find the nearest holiday

2017-04-25 Thread Wen Pei Yu
TypeError: unorderable types: str() >= datetime.date()   Should transfer string to Date type when compare.   Yu Wenpei.   - Original message -From: Zeming Yu To: user Cc:Subject: how to find the nearest holidayDate: Tue, Apr 25, 2017 3:39 PM  I have a column of dates (date type), just tryin

how to find the nearest holiday

2017-04-25 Thread Zeming Yu
I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below? start_date_test = flight3.select("start_date").distinct() start_date_test.show() holidays = ['2017-09-01', '2017-10-01'] +--+ |start_date| +--+