Avro/Parquet GenericFixed decimal is not read into Spark correctly

2017-04-12 Thread Justin Pihony
All, Before creating a JIRA for this I wanted to get a sense as to whether it would be shot down or not: Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new

Re: Design patterns involving Spark

2017-04-12 Thread Harish Butani
BTW, we now support OLAP functionality natively in spark w/o the need for Druid, through our Spark native BI platform(SNAP): https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani - we provide SQL commands to: create star schema, create olap index, and

Deploying Spark Applications. Best Practices And Patterns

2017-04-12 Thread Sam Elamin
Hi All, Really useful information on this thread. We moved a bit off topic since the initial question was how to schedule spark jobs in AWS. I do think however that there are loads of great insights here within the community so I have renamed the subject to "Deploying Spark Applications. Best

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread lucas.g...@gmail.com
"Building data products is a very different discipline from that of building software." That is a fundamentally incorrect assumption. There will always be a need for figuring out how to apply said principles, but saying 'we're different' has always turned out to be incorrect and I have seen no

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 12 Apr 2017, at 17:25, Gourav Sengupta > wrote: Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Gourav Sengupta
Hi, Your answer is like saying, I know how to code in assembly level language and I am going to build the next GUI in assembly level code and I think that there is a genuine functional requirement to see a color of a button in green on the screen. Perhaps it may be pertinent to read the first

Hive ::: how to select where conditions dynamically using CASE

2017-04-12 Thread nancy henry
Hi , Lets say I have a employee table testtab1.empid testtab1.empnametesttab1.joindate testtab1.bonus 1 sirisha 15-06-2016 60 2 Arun15-10-2016 20 3 divya 17-06-2016 80 4 rahul 16-01-2016 30 5 kokila 17-02-2016

Re: Optimisation Tips

2017-04-12 Thread Pushkar.Gujar
Not a expert, but groupByKey operation is well known to cause lot of shuffling and usually operation performed by groupbykey operation can be replaced by reducebykey. Here is great article on groupByKey operation -

Re: Optimisation Tips

2017-04-12 Thread KhajaAsmath Mohammed
Hi Steve, I have implemented repartitions on dataframe to 1. It helped the performance but not to a great extent. I am also looking for answers from the experts. Thanks, Asmath On Wed, Apr 12, 2017 at 9:45 AM, Steve Robinson < steve.robin...@aquilainsight.com> wrote: > Hi, > > > Does anyone

Optimisation Tips

2017-04-12 Thread Steve Robinson
Hi, Does anyone have any optimisation tips or could propose an alternative way to perform the below: val groupedUserItems1 = userItems1.groupByKey{_.customer_id} val groupedUserItems2 = userItems2.groupByKey{_.customer_id} groupedUserItems1.cogroup(groupedUserItems2){ case (_, userItems1,

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Sam Elamin
Hi To be honest there are a variety of options but it all comes down to who will be querying these dashboards. If the end user is an engineer then the ELK stack is fine and I can attest to the ease of use of kibana since I used it quite heavily. On the other hand in my experience it isnt the

Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Alonso Isidoro Roman
I forked some time ago a project, maybe you can use it. https://github.com/alonsoir/SparkTwitterAnalyzer Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 11 Apr 2017, at 20:46, Gourav Sengupta > wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread tencas
Hi Gaurav1809 , I was thinking about using elasticsearch + kibana too (actually don't know the differences between ELK and elasticsearch). I was wondering about pros and cons of using a document indexer vs NoSQL database. -- View this message in context:

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-12 Thread Gaurav1809
May be you can injest your data in ELK and use Kibana for live reporting. Of course there can be better way of doing this. Waiting for others to share their opinion. Thanks. -- View this message in context:

Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Jayant Shekhar
Hello Gaurav, Yes, Stanford CoreNLP is of course great to use too! You can find sample code here and pull the UDF's into your project : https://github.com/sparkflows/sparkflows-stanfordcorenlp Thanks, Jayant On Tue, Apr 11, 2017 at 8:44 PM, Gaurav Pandya wrote: >

Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Georg Heiler
I upgraded some dependencies here https://github.com/geoHeil/spark-corenlp and currently use it for an University project. Would also be interested in better libraries for spark. Tokenization and lemmatizaion work fine. Regards Georg hosur narahari schrieb am Mi. 12. Apr.