Spark Sparser library

2018-08-09 Thread umargeek
Hi Team, Please let me know the spark Sparser library to use while submitting the spark application to use below mentioned format, val df = spark.read.format("*edu.stanford.sparser.json*") When I used above format it throwed error class not found exception. Thanks, Umar -- Sent from:

Re: How to validate orc vectorization is working within spark application?

2018-07-12 Thread umargeek
Hello Jorn, I am unable to post the entire code due to some data sharing related issues. Use Case: I am performing aggregations after reading data from HDFS file every min would like to understand how to perform using vectorisation enabled and what are pre requisite to successfully to enable

Pyspark Structured Streaming Error

2018-07-12 Thread umargeek
Hi All, I am trying to test structured streaming using pyspark mentioned below spark submit commands and packages used * pyspark2 --master=yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --packages org.apache.kafka:kafka-clients:0.10.0.1* but getting following error (in bold),

Spark DF to Hive table with both Partition and Bucketing not working

2018-06-19 Thread umargeek
Hi Folks, I am trying to save a spark data frame after reading from ORC file and add two new columns and finally trying to save it to hive table with both partition and bucketing feature. Using Spark 2.3 (as both partition and bucketing feature are available in this version). Looking for

Re: How to validate orc vectorization is working within spark application?

2018-06-19 Thread umargeek
Hi Folks, I would just require few pointers on the above query w.r.t vectorization looking forward for support from the community. Thanks, Umar -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: testing frameworks

2018-05-22 Thread umargeek
Hi Steve, you can try out pytest-spark plugin if your writing programs using pyspark ,please find below link for reference. https://github.com/malexer/pytest-spark Thanks, Umar -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Alternative for numpy in Spark Mlib

2018-05-22 Thread umargeek
Hi Folks, I am planning to rewrite one of my python module written for entropy calculation using numpy into Spark Mlib so that it can be processed in distributed manner. Can you please advise on the possibilities of the same approach or any alternatives. Thanks, Umar -- Sent from:

How to validate orc vectorization is working within spark application?

2018-05-22 Thread umargeek
Hi Folks, I have enabled below listed configurations within my spark streaming application but I did not gain performance benefit even after setting these parameters ,can you please help me is there a way to validate whether vectorization is working as expeced/enabled correctly ! Note: I am

Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread umargeek
Hi All, We are currently looking for real-time streaming analytics of data stored as Spark SQL tables is there any external connectivity available to connect with BI tools(Pentaho/Arcadia). currently, we are storing data into the hive tables but its response on the Arcadia dashboard is slow.

Re: How to write dataframe to kafka topic in spark streaming application using pyspark other than collect?

2017-12-07 Thread umargeek
Hi Team, Can someone please advise me on the above post since because of this I have written data file to HDFS location. So as of now am just passing the filename into Kafka topic and not utilizing Kafka potential at the best looking forward to suggestions. Thanks, Umar -- Sent from:

Suggestions on using scala/python for Spark Streaming

2017-10-26 Thread umargeek
We are building a spark streaming application which is process and time intensive and currently using python API but looking forward for suggestions whether to use Scala over python such as pro's and con's as we are planning to production setup as next step? Thanks, Umar -- Sent from:

How to write dataframe to kafka topic in spark streaming application using pyspark?

2017-09-25 Thread umargeek
Can anyone provide me code snippet/ steps to write a data frame to Kafka topic in a spark streaming application using pyspark with spark 2.1.1 and Kafka 0.8 (Direct Stream Approach)? Thanks, Umar -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/