Structured Streaming for tweets

2016-05-21 Thread singinpirate
Hi all, The co-founder of Databricks just demo'ed that we can stream tweets with structured streaming: https://youtu.be/9xSz0ppBtFg?t=16m42s but he didn't show how he did it - does anyone know how to provide credentials to structured streaming?

Re: How to set the degree of parallelism in Spark SQL?

2016-05-21 Thread Ted Yu
Looks like an equal sign is missing between partitions and 200. On Sat, May 21, 2016 at 8:31 PM, SRK wrote: > Hi, > > How to set the degree of parallelism in Spark SQL? I am using the following > but it somehow seems to allocate only two executors at a time. > >

How to set the degree of parallelism in Spark SQL?

2016-05-21 Thread SRK
Hi, How to set the degree of parallelism in Spark SQL? I am using the following but it somehow seems to allocate only two executors at a time. sqlContext.sql(" set spark.sql.shuffle.partitions 200 ") Thanks, Swetha -- View this message in context:

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Ovidiu-Cristian MARCU
Thank you, Amit! I was looking for this kind of information. I did not fully read your paper, I see in it a TODO with basically the same question(s) [1], maybe someone from Spark team (including Databricks) will be so kind to send some feedback.. Best, Ovidiu [1] Integrate “Structured

Re: Wide Datasets (v1.6.1)

2016-05-21 Thread Don Drake
I was able to verify the similar exceptions occur in Spark 2.0.0-preview. I have create this JIRA: https://issues.apache.org/jira/browse/SPARK-15467 You mentioned using beans instead of case classes, do you have an example (or test case) that I can see? -Don On Fri, May 20, 2016 at 3:49 PM,

Hive 2.0 on Spark 1.6.1 Engine

2016-05-21 Thread Mich Talebzadeh
Hi, I usually run Hive 2 on Spark 1..3.1 engine (as opposed using the default MR or TEZ). I tried to make Hive 2 work with TEZ 0.82 but that did not do much. Anyway I will try to make it work. Today I compiled Spark 1.6.1 from source excluding the Hadoop libraries. I did this one before for

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
It seems I forgot to add the link to the “Technical Vision” paper so there it is - https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing From: "Sela, Amit" > Date: Saturday, May 21, 2016 at 11:52 PM To:

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
This is a “Technical Vision” paper for the Spark runner, which provides general guidelines to the future development of Spark’s Beam support as part of the Apache Beam (incubating) project. This is our JIRA -

Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-21 Thread unk1102
Hi I am having DataFrame with huge skew data in terms of TB and I am doing groupby on 8 fields which I cant avoid unfortunately. I am looking to optimize this I have found hive has set hive.groupby.skewindata=true; I dont use Hive I have Spark DataFrame can we achieve above Spark? Please guide.

Re: Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Reynold Xin
https://issues.apache.org/jira/browse/SPARK-15078 was just a bunch of test harness and added no new functionality. To reduce confusion, I just backported it into branch-2.0 so SPARK-15078 is now in 2.0 too. Can you paste a query you were testing? On Sat, May 21, 2016 at 10:49 AM, Kamalesh Nair

How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco1982
Hi experts, I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using textFileStream(). The two data streams aren't

Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Kamalesh Nair
Hi, >From the Spark 2.0 Release webinar what I understood is, the newer version have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for Subqueries. It also says, Spark 2.0 can run all the 99 TPC-DS queries, which require many of

How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco Platania
Hi experts,I'm using Apache Spark Streaming 1.6.1 to write a Java application that joins two Key/Value data streams and writes the output to HDFS. The two data streams contain K/V strings and are periodically ingested in Spark from HDFS by using textFileStream(). The two data streams aren't

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
I got my answer. The way to access S3 has changed. val hadoopConf = sc.hadoopConfiguration hadoopConf.set("fs.s3a.access.key", accessKey) hadoopConf.set("fs.s3a.secret.key", secretKey) val lines = ssc.textFileStream("s3a://amg-events-out/") This worked. Cheers, Ben > On May 21, 2016, at

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Sri
Thanks Ted, I know in spark-she'll can we set same in spark-sql shell ? If I don't set hive context from my understanding spark is using its own SQL and date functions right ? Like for example interval ? Thanks Sri Sent from my iPhone > On 21 May 2016, at 08:19, Ted Yu

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Ted Yu
In spark-shell: scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> var hc: HiveContext = new HiveContext(sc) FYI On Sat, May 21, 2016 at 8:11 AM, Sri wrote: > Hi , > > You mean hive-site.xml file right ?,I did

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Sri
Hi , You mean hive-site.xml file right ?,I did placed the hive-site.xml in spark conf but not sure how spark certain date functions like interval is still working . Hive 0.14 don't have interval function but how spark is managing to do that ? Does spark has its own date functions ? I am using

Re: Unit testing framework for Spark Jobs?

2016-05-21 Thread Lars Albertsson
Not that I can share, unfortunately. It is on my backlog to create a repository with examples, but I am currently a bit overloaded, so don't hold your breath. :-/ If you want to be notified when it happens, please follow me on Twitter or Google+. See web site below for links. Regards, Lars

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
Ted, I only see 1 jets3t-0.9.0 jar in the classpath after running this to list the jars. val cl = ClassLoader.getSystemClassLoader cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println) /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/jets3t-0.9.0.jar I don’t know what else

Re: spark on yarn

2016-05-21 Thread Shushant Arora
3.And is the same behavior applied to streaming application also? On Sat, May 21, 2016 at 7:44 PM, Shushant Arora wrote: > And will it allocate rest executors when other containers get freed which > were occupied by other hadoop jobs/spark applications? > > And is

Re: spark on yarn

2016-05-21 Thread Shushant Arora
And will it allocate rest executors when other containers get freed which were occupied by other hadoop jobs/spark applications? And is there any minimum (% of executors demanded vs available) executors it wait for to be freed or just start with even 1 . Thanks! On Thu, Apr 21, 2016 at 8:39 PM,

Re: Spark Streaming S3 Error

2016-05-21 Thread Ted Yu
Maybe more than one version of jets3t-xx.jar was on the classpath. FYI On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim wrote: > I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of > Spark 1.6.0. It seems not to work. I keep getting this error. > >

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Jörn Franke
What is the motivation to use such an old version of Hive? This will lead to less performance and other risks. > On 21 May 2016, at 01:57, "kali.tumm...@gmail.com" > wrote: > > Hi All , > > Is there a way to ask spark and spark-sql to use Hive 0.14 version instead >

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Mich Talebzadeh
Sou want to use hive version 0.14 when using Spark 1.6? Go to directory $SPARK_HOME/conf and create a softlink to hive-core.xml file *cd $SPARK_HOME* hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6> *cd conf*hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ls -ltr lrwxrwxrwx 1

How to avoid empty unavoidable group by keys in DataFrame?

2016-05-21 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group