Avro/Parquet GenericFixed decimal is not read into Spark correctly

2017-04-12 Thread Justin Pihony
All, Before creating a JIRA for this I wanted to get a sense as to whether it would be shot down or not: Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new Conversions.Deci

Re: Design patterns involving Spark

2017-04-12 Thread Harish Butani
BTW, we now support OLAP functionality natively in spark w/o the need for Druid, through our Spark native BI platform(SNAP): https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani - we provide SQL commands to: create star schema, create olap index, and inser

Re: Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Daniel Siegmann
On Wed, Apr 12, 2017 at 4:11 PM, Sam Elamin wrote: > > When it comes to scheduling Spark jobs, you can either submit to an > already running cluster using things like Oozie or bash scripts, or have a > workflow manager like Airflow or Data Pipeline to create new clusters for > you. We went down t

Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Sam Elamin
Hi All, Really useful information on this thread. We moved a bit off topic since the initial question was how to schedule spark jobs in AWS. I do think however that there are loads of great insights here within the community so I have renamed the subject to "Deploying Spark Applications. Best Prac

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread lucas.g...@gmail.com
"Building data products is a very different discipline from that of building software." That is a fundamentally incorrect assumption. There will always be a need for figuring out how to apply said principles, but saying 'we're different' has always turned out to be incorrect and I have seen no re

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 12 Apr 2017, at 17:25, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a col

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Gourav Sengupta
Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a color of a button in green on the screen. Perhaps it may be pertinent to read the first pre

Hive ::: how to select where conditions dynamically using CASE

2017-04-12 Thread nancy henry
Hi , Lets say I have a employee table testtab1.empid testtab1.empnametesttab1.joindate testtab1.bonus 1 sirisha 15-06-2016 60 2 Arun15-10-2016 20 3 divya 17-06-2016 80 4 rahul 16-01-2016 30 5 kokila 17-02-2016

Re: Optimisation Tips

2017-04-12 Thread Pushkar.Gujar
Not a expert, but groupByKey operation is well known to cause lot of shuffling and usually operation performed by groupbykey operation can be replaced by reducebykey. Here is great article on groupByKey operation - https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_tran

Re: Optimisation Tips

2017-04-12 Thread KhajaAsmath Mohammed
Hi Steve, I have implemented repartitions on dataframe to 1. It helped the performance but not to a great extent. I am also looking for answers from the experts. Thanks, Asmath On Wed, Apr 12, 2017 at 9:45 AM, Steve Robinson < steve.robin...@aquilainsight.com> wrote: > Hi, > > > Does anyone hav

Optimisation Tips

2017-04-12 Thread Steve Robinson
Hi, Does anyone have any optimisation tips or could propose an alternative way to perform the below: val groupedUserItems1 = userItems1.groupByKey{_.customer_id} val groupedUserItems2 = userItems2.groupByKey{_.customer_id} groupedUserItems1.cogroup(groupedUserItems2){ case (_, userItems1, u

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Sam Elamin
Hi To be honest there are a variety of options but it all comes down to who will be querying these dashboards. If the end user is an engineer then the ELK stack is fine and I can attest to the ease of use of kibana since I used it quite heavily. On the other hand in my experience it isnt the eng

Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Alonso Isidoro Roman
I forked some time ago a project, maybe you can use it. https://github.com/alonsoir/SparkTwitterAnalyzer Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 11 Apr 2017, at 20:46, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm happy to be faulted for treati

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread tencas
Hi Gaurav1809 , I was thinking about using elasticsearch + kibana too (actually don't know the differences between ELK and elasticsearch). I was wondering about pros and cons of using a document indexer vs NoSQL database. -- View this message in context: http://apache-spark-user-list.1001560.n

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Gaurav1809
May be you can injest your data in ELK and use Kibana for live reporting. Of course there can be better way of doing this. Waiting for others to share their opinion. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Real-time-save-data-