Re: Spark Testing Library Discussion

2017-04-28 Thread lucas.g...@gmail.com
Awesome, thanks. Just reading your post A few observations: 1) You're giving out Marius's email: "I have been lucky enough to build this pipeline with the amazing Marius Feteanu". A linked or github link might be more helpful. 2) "If you are in Pyspark world sadly Holden’s test base wont work

Could any one please tell me why this takes forever to finish?

2017-04-28 Thread Yuan Fang
object SparkPi { private val logger = Logger(this.getClass) val sparkConf = new SparkConf() .setAppName("Spark Pi") .setMaster("spark://10.100.103.192:7077") lazy val sc = new SparkContext(sparkConf) sc.addJar("/Users/yfang/workspace/mcs/target/scala-2.11/root-assembly-0.1.0.jar")

Re: "java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread Felix Cheung
Can you allocate more memory to the executor? Also please open issue with gf on its github From: rok Sent: Friday, April 28, 2017 1:42:33 AM To: user@spark.apache.org Subject: "java.lang.IllegalStateException: There is no space for new

Re: how to create List in pyspark

2017-04-28 Thread Felix Cheung
Why no use sql functions explode and split? Would perform and be more stable then udf From: Yanbo Liang Sent: Thursday, April 27, 2017 7:34:54 AM To: Selvam Raman Cc: user Subject: Re: how to create List in pyspark ​You can try with UDF, like

Re: Securing Spark Job on Cluster

2017-04-28 Thread Mark Hamstra
spark.local.dir http://spark.apache.org/docs/latest/configuration.html On Fri, Apr 28, 2017 at 8:51 AM, Shashi Vishwakarma < shashi.vish...@gmail.com> wrote: > Yes I am using HDFS .Just trying to understand couple of point. > > There would be two kind of encryption which would be required. > >

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
It's asynchronous. If your job stopped before the commit happened, then of course it's not guaranteed to succeed. But even if those commits were somehow guaranteed to succeed even if your job stopped... you still need idempotent output operations. The point of transactionality isn't that it's

Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Yes I am using HDFS .Just trying to understand couple of point. There would be two kind of encryption which would be required. 1. Data in Motion - This could be achieved by enabling SSL -

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
Yes, I saw that sentence too. But it's rather short and not very explanatory, and there doesn't seem to be any further info available anywhere that expands on it. When I parse out that sentence: 1) "Kafka is not transactional" - i.e., the commits are done asynchronously, not synchronously. 2)

Re: Securing Spark Job on Cluster

2017-04-28 Thread Jörn Franke
Why don't you use whole disk encryption? Are you using HDFS? > On 28. Apr 2017, at 16:57, Shashi Vishwakarma > wrote: > > Agreed Jorn. Disk encryption is one option that will help to secure data but > how do I know at which location Spark is spilling temp file,

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
>From that doc: " However, Kafka is not transactional, so your outputs must still be idempotent. " On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch wrote: > I'm doing a POC to test recovery with spark streaming from Kafka. I'm using > the technique for storing the

Spark user list seems to be rejecting/ignoring my emails from other subscribed address

2017-04-28 Thread David Rosenstrauch
I've been subscribed to the user@spark.apache.org list at another email address since 2014. That address receives all emails sent to the list without a problem. But for some reason any emails that *I* send to the list from that address get ignored or rejected. (Similarly, any emails I send to

Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
I'm doing a POC to test recovery with spark streaming from Kafka. I'm using the technique for storing the offsets in Kafka, as described at: https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself I.e., grabbing the list of offsets before I start processing a

Re: removing columns from file

2017-04-28 Thread Anubhav Agarwal
Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Anubhav On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia < bardia.afs...@capitalone.com> wrote: > Hi there, > > > > I have a process that downloads

Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Agreed Jorn. Disk encryption is one option that will help to secure data but how do I know at which location Spark is spilling temp file, shuffle data and application data ? Thanks Shashi On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke wrote: > You can use disk encryption as

Re: Securing Spark Job on Cluster

2017-04-28 Thread Jörn Franke
You can use disk encryption as provided by the operating system. Additionally, you may think about shredding disks after they are not used anymore. > On 28. Apr 2017, at 14:45, Shashi Vishwakarma > wrote: > > Hi All > > I was dealing with one the spark requirement

Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Kerberos is not a apache project. Kerberos provides a way to do authentication but does not provide data security. On Fri, Apr 28, 2017 at 3:24 PM, veera satya nv Dantuluri < dvsnva...@gmail.com> wrote: > Hi Shashi, > > Based on your requirement for securing data, we can use Apache kebros, or >

Re: Securing Spark Job on Cluster

2017-04-28 Thread veera satya nv Dantuluri
Hi Shashi, Based on your requirement for securing data, we can use Apache kebros, or we could use the security feature in Spark. > On Apr 28, 2017, at 8:45 AM, Shashi Vishwakarma > wrote: > > Hi All > > I was dealing with one the spark requirement here where

Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Hi All I was dealing with one the spark requirement here where Client (like Banking Client where security is major concern) needs all spark processing should happen securely. For example all communication happening between spark client and server ( driver & executor communication) should be on

Re: Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me.

2017-04-28 Thread u...@moosheimer.com
A really good one is vaderSentiment (https://github.com/cjhutto/vaderSentiment). regards Kay-Uwe Moosheimer Am 28.04.2017 um 12:24 schrieb Alonso Isidoro Roman: > I forked some time ago a twitter analyzer, but i think the best is to > provide the original link >

Re: Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me.

2017-04-28 Thread Alonso Isidoro Roman
I forked some time ago a twitter analyzer, but i think the best is to provide the original link . If you want you can take a look to my fork . regards Alonso Isidoro Roman [image:

"java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread rok
When running the connectedComponents algorithm in GraphFrames on a sufficiently large dataset, I get the following error I have not encountered before: 17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID 53644, 172.19.1.206, executor 40): java.lang.IllegalStateException:

Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It does not work as desired for me.

2017-04-28 Thread Gaurav1809
Has anyone used CoreNLP from stanford for sentiment analysis in Spark? It is not working as desired or may be I need to do some work which I am not aware of. Following is the example. 1). I look forward to interacting with kids of states governed by the congress. - POSITIVE 2). I look forward to

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread madhu phatak
SparkSession.builder.config() takes SparkConf as parameter. You can use that to pass SparkConf as it is. https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#config(org.apache.spark.SparkConf) On Fri, Apr 28, 2017 at 11:40 AM, Yanbo Liang

Synonym handling replacement issue with UDF in Apache Spark

2017-04-28 Thread Nishanth
I am facing a major issue on replacement of Synonyms in my DataSet. I am trying to replace the synonym of the Brand names to its equivalent names. I have tried 2 methods to solve this issue. Method 1 (regexp_replace) Here i am using the regexp_replace method. Hashtable manufacturerNames = new

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread Yanbo Liang
StreamingContext is an old API, if you want to process streaming data, you can use SparkSession directly. FYI: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Thanks Yanbo On Fri, Apr 28, 2017 at 12:12 AM, kant kodali wrote: > Actually one