RE: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread hosur narahari
Tensorflow provides NLP implementation which uses deep learning technology. But it's not distributed. So you can try to integrate spark with Tensorflow. Best Regards, Hari On 11 Apr 2017 11:44 p.m., "Gabriel James" wrote: > Me too. Experiences and recommendations

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gaurav Pandya
Thanks guys. How about Standford CoreNLP? Any reviews/ feedback? Please share the details if anyone has used it in past. On Tue, Apr 11, 2017 at 11:46 PM, wrote: > I think team used this awhile ago, but there was some tweak that needed to > be made to get it to

Re: [Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Matt Deaver
It's pretty simple, really: you would run your processing job as much as you want during the week then when loading into the base table do a window function based on the primary key(s) and order by the updated time column, then delete the existing rows with those pks and load that data. On Tue,

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sumona Routh
Hi Sam, I would absolutely be interested in reading a blog write-up of how you are doing this. We have pieced together a relatively decent pipeline ourselves, in jenkins, but have many kinks to work out. We also have some new requirements to start running side by side comparisons of different

Re: [Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Vamsi Makkena
Hi Matt, Thanks for your reply. I will get updates regularly but I want to load the updated data once in a week. Staging table may solve this issue, but I'm looking for how row updated time should include in the query. Thanks On Tue, Apr 11, 2017 at 2:59 PM Matt Deaver

Re: [Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Matt Deaver
Do you have updates coming in on your data flow? If so, you will need a staging table and a merge process into your Teradata tables. If you do not have updated rows aka your Teradata tables are append-only you can process data and insert (bulk load) into Teradata. I don't have experience doing

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Gourav Sengupta
And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. Regards, Gourav On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin wrote: > Hi Steve > > > Thanks for the

[Spark-SQL] : Incremental load in Pyspark

2017-04-11 Thread Vamsi Makkena
I am reading the data from Oracle tables and Flat files (new excel file every week) and write it to Teradata weekly using Pyspark. In the initial run it will load the all the data to Teradata. But in the later runs I just want to read the new records from Oracle and Flatfiles and want to append

Exception on Join with Spark2.1

2017-04-11 Thread Andrés Ivaldi
Hello, I'm using spark embedded, So far with Spark 2.0.2 was all ok, after update Spark to 2.1.0, I'm having problems when join to Datset. The query are generated dinamically, but I have two Dataset one with a WindowFunction and the other is de same Dataset before the application of the

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Ian.Maloney
I think team used this awhile ago, but there was some tweak that needed to be made to get it to work. https://github.com/databricks/spark-corenlp From: Gabriel James > Organization: Heliase Genomics Reply-To:

RE: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gabriel James
Me too. Experiences and recommendations please. Gabriel From: Kevin Wang [mailto:buz...@gmail.com] Sent: Wednesday, April 12, 2017 6:11 AM To: Alonso Isidoro Roman Cc: Gaurav1809 ; user@spark.apache.org Subject: Re: Any NLP library for

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Kevin Wang
I am also interested in this topic. Anything else anyone can recommend? Thanks. Best, Kevin On Tue, Apr 11, 2017 at 5:00 AM, Alonso Isidoro Roman wrote: > i did not use it yet, but this library looks promising: > > https://github.com/databricks/spark-corenlp > > > Alonso

Feasability limits of joins in SparkSQL (Why does my driver explode with a large number of joins?)

2017-04-11 Thread Rick Moritz
Hi List, I'm currently trying to naively implement a Data-Vault-type Data-Warehouse using SparkSQL, and was wondering whether there's an inherent practical limit to query complexity, beyond which SparkSQL will stop functioning, even for relatively small amounts of data. I'm currently looking at

Re: optimising storage and ec2 instances

2017-04-11 Thread Sam Elamin
Hi Zeming Yu, Steve Just to add, we are also going down partitioning using this route but you should know if you are in AWS land, you are most likely going to use EMRs at any given time At the moment EMRs does not do recursive search on wildcards, see this

Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread Pierce Lamb
Hi, It is possible to use Mongo or Cassandra to persist results from Spark. In fact, a wide variety of data stores are available to use with Spark and many are aimed at serving queries for dashboard visualizations. I cannot comment on which work well with Grafana or Kabana, however, I've listed

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
For some reason my pasted screenshots were removed when I sent the email (at least that's how it appeared on my end). Repasting as text below. The sequence you are referring to represents the list of column names to fill. I am asking about filling a column which is of type list with an empty

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
The sequence you are referring to represents the list of column names to fill. I am asking about filling a column which is of type list with an empty list. Here is a quick example of what I am doing: The output of the show and printSchema for the collectList df: So, the last line which

Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread tencas
I've developed an application using Apache Spark Streaming, that reads simple info from plane sensors like acceleration, via TCP sockets on json format, and analyse it. I'd like to be able to persist this info from each "flight" on real-time, while it is shown on any responsive dashboard. I just

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sam Elamin
Hi Steve Thanks for the detailed response, I think this problem doesn't have an industry standard solution as of yet and I am sure a lot of people would benefit from the discussion I realise now what you are saying so thanks for clarifying, that said let me try and explain how we approached the

Re: optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
everything works best if your sources are a few tens to hundreds of MB or more Are you referring to the size of the zip file or individual unzipped files? Any issues with storing a 60 mb zipped file containing heaps of text files inside? On 11 Apr. 2017 9:09 pm, "Steve Loughran"

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Alonso Isidoro Roman
i did not use it yet, but this library looks promising: https://github.com/databricks/spark-corenlp Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-04-11

Re: unit testing in spark

2017-04-11 Thread Elliot West
Jörn, I'm interested in your point on coverage. Coverage has been a useful tool for highlighting areas in the codebase that pose a source of potential risk. However, generally speaking, I've found that traditional coverage tools do not provide useful information when applied to distributed data

Re: optimising storage and ec2 instances

2017-04-11 Thread Steve Loughran
> On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > Hi all, > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > Background: I have a data set growing by 6 TB p.a. I plan to use spark to > read in all

Re: unit testing in spark

2017-04-11 Thread Steve Loughran
(sorry sent an empty reply by accident) Unit testing is one of the easiest ways to isolate problems in an an internal class, things you can get wrong. But: time spent writing unit tests is time *not* spent writing integration tests. Which biases me towards the integration. What I do find is

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Steve Loughran
On 7 Apr 2017, at 18:40, Sam Elamin > wrote: Definitely agree with gourav there. I wouldn't want jenkins to run my work flow. Seems to me that you would only be using jenkins for its scheduling capabilities Maybe I was just looking at

optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
Hi all, I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have. Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store

Spark (SQL / Structured Streaming) Cassandra - PreparedStatement

2017-04-11 Thread Bastien DINE
Hi everyone, I'm using Spark Structured Streaming for Machine Learning purpose in real time, and I want to stored predictions in my Cassandra cluster. Since I am in a streaming context, executing multiple times per seconds the same request, one mandatory optimization is to use

Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gaurav1809
Hi All, I need to determine sentiment for given document (statement, paragraph etc.) Is there any NLP library available with Apache Spark that I can use here? Any other pointers towards this would be highly appreciated. Thanks in advance. Gaurav Pandya -- View this message in context:

Re: Dataframes na fill with empty list

2017-04-11 Thread Didac Gil
It does support it, at least in 2.0.2 as I am running: Here one example: val parsedLines = stream_of_logs .map(line => p.parseRecord_viaCSVParser(line)) .join(appsCateg,$"Application"===$"name","left_outer") .drop("id") .na.fill(0, Seq(“numeric_field1”,"numeric_field2")) .na.fill("",

Re: Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-11 Thread kant kodali
Thank you so much! On Mon, Apr 10, 2017 at 2:47 PM, Tathagata Das wrote: > As of now (Spark 2.2), Structured Streaming does checkpoint of the state > data synchronously in every trigger. But the checkpointing is incremental, > so it wont be writing all your state