How to PushDown ParquetFilter Spark 2.0.1 dataframe

2017-03-30 Thread Rahul Nandi
Hi, I have around 2 million data as parquet file in s3. The file structure is somewhat like id data 1 abc 2 cdf 3 fas Now I want to filter and take the records where the id matches with my required Id. val requiredDataId = Array(1,2) //Might go upto 100s of records.

Parquet Filter PushDown

2017-03-30 Thread Rahul Nandi
Hi, I have around 2 million data as parquet file in s3. The file structure is somewhat like id data 1 abc 2 cdf 3 fas Now I want to filter and take the records where the id matches with my required Id. val requiredDataId = Array(1,2) //Might go upto 100s of records.

Predicate not getting pusdhown to PrunedFilterScan

2017-03-30 Thread Hanumath Rao Maduri
Hello All, I am working on creating a new PrunedFilteredScan operator which has the ability to execute the predicates pushed to this operator. However What I observed is that if column with deep in the hierarchy is used then it is not getting pushed down. SELECT tom._id, tom.address.city from

Re: dataframe filter, unable to bind variable

2017-03-30 Thread hosur narahari
Try lit(fromDate) and lit(toDate). You've to import org.apache.spark.sql.functions.lit to use it On 31 Mar 2017 7:45 a.m., "shyla deshpande" wrote: The following works df.filter($"createdate".between("2017-03-20", "2017-03-22")) I would like to pass variables

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Also, is it possible to cache logical plan and parsed query so that in subsequent executions it can be reused. It would improve overall query performance particularly in streaming jobs On Thu, Mar 30, 2017 at 10:06 PM Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hi Ayan, > > I

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Ayan, I have searched Spark configuration options but couldn't find one to pin execution plans in memory. Can you please help? Thanks Sathish On Thu, Mar 30, 2017 at 9:30 PM ayan guha wrote: > I think there is an option of pinning execution plans in memory to avoid >

Re: Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread ayan guha
I think there is an option of pinning execution plans in memory to avoid such scenarios On Fri, Mar 31, 2017 at 1:25 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hi Everyone, > > I have complex SQL with approx 2000 lines of code and works with 50+ > tables with 50+

Spark SQL 2.1 Complex SQL - Query Planning Issue

2017-03-30 Thread Sathish Kumaran Vairavelu
Hi Everyone, I have complex SQL with approx 2000 lines of code and works with 50+ tables with 50+ left joins and transformations. All the tables are fully cached in Memory with sufficient storage memory and working memory. The issue is after the launch of the query for the execution; the query

dataframe filter, unable to bind variable

2017-03-30 Thread shyla deshpande
The following works df.filter($"createdate".between("2017-03-20", "2017-03-22")) I would like to pass variables fromdate and todate to the filter instead of constants. Unable to get the syntax right. Please help. Thanks

Re: Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
The spark version I am using is spark 2.1. On Thu, Mar 30, 2017 at 9:58 AM, shyla deshpande wrote: > Thanks >

Looking at EMR Logs

2017-03-30 Thread Paul Tremblay
I am looking for tips on evaluating my Spark job after it has run. I know that right now I can look at the history of jobs through the web ui. I also know how to look at the current resources being used by a similar web ui. However, I would like to look at the logs after the job is finished to

Predicate not getting pusdhown to PrunedFilterScan

2017-03-30 Thread Hanumath Rao Maduri
Hello All, I am working on creating a new PrunedFilteredScan operator which has the ability to execute the predicates pushed to this operator. However What I observed is that if column with deep in the hierarchy is used then it is not getting pushed down. SELECT tom._id, tom.address.city from

spark 2 and kafka consumer with ssl/kerberos

2017-03-30 Thread bilsch
Ok, forgive me if this ends up being a duplicate posting I've emailed it twice and it never shows up! --- 'm working on a poc spark job to pull data from a kafka topic with kerberos enabled ( required ) brokers. The code seems to connect to kafka and enter a polling mode. When I toss something

spark 2 and kafka consumer with ssl/kerberos

2017-03-30 Thread Bill Schwanitz
I'm working on a poc spark job to pull data from a kafka topic with kerberos enabled ( required ) brokers. The code seems to connect to kafka and enter a polling mode. When I toss something onto the topic I get an exception which I just can't seem to figure out. Any ideas? I have a full gist up

Re: Spark streaming + kafka error with json library

2017-03-30 Thread Srikanth
Thanks for the tip. That worked. When would one use the assembly? On Wed, Mar 29, 2017 at 7:13 PM, Tathagata Das wrote: > Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) > > On Wed, Mar 29, 2017 at 9:59 AM, Srikanth

Re: httpclient conflict in spark

2017-03-30 Thread Arvind Kandaswamy
Hi Steve, I was indeed using spark 2.1. I was getting this error while calling spark via Zeppelin. Zeppelin comes with older version of httpclient apparently. I copied the httpclient 4.5.2 and httpclient 4.2.2 into zeppelin/interpreter/spark and this problem went away. Thank you for your help.

spark kafka consumer with kerberos

2017-03-30 Thread Bill Schwanitz
I'm working on a poc spark job to pull data from a kafka topic with kerberos enabled ( required ) brokers. The code seems to connect to kafka and enter a polling mode. When I toss something onto the topic I get an exception which I just can't seem to figure out. Any ideas? I have a full gist up

Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
sorry meant to say: we know when we upgrade that we might run into minor inconveniences that are completely our own doing/fault. also, with yarn it has become really easy to run against an exact spark version of our choosing, since there is no longer such a thing as a centrally managed spark

Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
i agree with that. we work within that assumption. we compile and run against a single exact spark version. we know when we upgrade that we might run into minor inconveniences that our completely our own doing/fault. the trade off has been totally worth it to me. On Thu, Mar 30, 2017 at 1:20

Re: Why VectorUDT private?

2017-03-30 Thread Michael Armbrust
I think really the right way to think about things that are marked private is, "this may disappear or change in a future minor release". If you are okay with that, working about the visibility restrictions is reasonable. On Thu, Mar 30, 2017 at 5:52 AM, Koert Kuipers wrote:

Will the setting for spark.default.parallelism be used for spark.sql.shuffle.output.partitions?

2017-03-30 Thread shyla deshpande
Thanks

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Pierce Lamb
SnappyData should work well for what you want, it deeply integrates an in-memory database with Spark which supports ingesting streaming data and concurrently querying it from a dashboard. SnappyData currently has an integration with Apache Zeppelin (notebook visualization) and soon it will have

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
you can check if you want this link elastic, kibana and spark working together. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Miel Hostens
We're doing exactly same thing over here! Spark + ELK Met vriendelijk groeten, *Miel Hostens*, *B**uitenpraktijk+* Department of Reproduction, Obstetrics and Herd Health Ambulatory Clinic Faculty of Veterinary Medicine Ghent University

Re: httpclient conflict in spark

2017-03-30 Thread Steve Loughran
On 29 Mar 2017, at 14:42, Arvind Kandaswamy > wrote: Hello, I am getting the following error. I get this error when trying to use AWS S3. This appears to be a conflict with httpclient. AWS S3 comes with httplient-4.5.2.jar. I am not

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Yong Zhang
Unfortunately, I don't think there is any optimized way to do this. Maybe someone else can correct me, but in theory, there is no way other than a cartesian product of your 2 sides if you can not change the data. Think about it, if you want to join between 2 different types (Array and Int in

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-30 Thread Karin Valisova
Looks like the parallelization into RDD was the right move I was omitting, JavaRDD jsonRDD = new JavaSparkContext(sparkSession. sparkContext()).parallelize(results); then I created a schema as List fields = new ArrayList(); fields.add(DataTypes.createStructField("column_name1",

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Szuromi Tamás
For us, after some Spark Streaming transformation, Elasticsearch + Kibana is a great combination to store and visualize data. An alternative solution that we use is Spark Streaming put some data back to Kafka and we consume it with nodejs. Cheers, Tamas 2017-03-30 9:25 GMT+02:00 Alonso Isidoro

Re: Why VectorUDT private?

2017-03-30 Thread Koert Kuipers
I stopped asking long time ago why things are private in spark... I mean... The conversion between ml and mllib vectors is private... the conversion between spark vector and breeze used to be (or still is?) private. it just goes on. Lots of useful stuff is private[SQL]. Luckily there are simple

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Alonso Isidoro Roman
Read this first: http://www.oreilly.com/data/free/big-data-analytics-emerging-architecture.csp https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Mungeol Heo
Hello ayan, Same key will not exists in different lists. Which means, If "1" exists in a list, then it will not be presented in another list. Thank you. On Thu, Mar 30, 2017 at 3:56 PM, ayan guha wrote: > Is it possible for one key in 2 groups in rdd2? > > [1,2,3] >

Re: Need help for RDD/DF transformation.

2017-03-30 Thread ayan guha
Is it possible for one key in 2 groups in rdd2? [1,2,3] [1,4,5] ? On Thu, 30 Mar 2017 at 12:23 pm, Mungeol Heo wrote: > Hello Yong, > > First of all, thank your attention. > Note that the values of elements, which have values at RDD/DF1, in the > same list will be

Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Gaurav Pandya
Hi Noorul, Thanks for the reply. But then how to build the dashboard report? Don't we need to store data anywhere? Please suggest. Thanks. Gaurav On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda < noo...@noorul.com> wrote: > I think better place would be a in memory cache for