Impersonate users using the same SparkContext

2016-09-15 Thread gsvigruha
Hi, is there a way to impersonate multiple users using the same SparkContext (e.g. like this https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html) when going through the Spark API? What I'd like to do is that 1) submit a long running Spark yarn-client

答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-15 Thread chen yong
Dear Dirceu, Thanks for your kind help. i cannot see any code line corresponding to ". retrieve the data from your DataFrame/RDDs". which you suggested in the previous replies. Later, I guess the line val test = count is the key point. without it, it would not stop at the

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Peyman Mohajerian
You can listen to files in a specific directory using: Take a look at: http://spark.apache.org/docs/latest/streaming-programming-guide.html streamingContext.fileStream On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke wrote: > Hi, > I recommend that the third party

Re: Guide Step by step Stark streaming

2016-09-15 Thread tosaigan...@gmail.com
Hi, You can refer HDinsight link. It has working example for cloud based eventhub-sparkstreaming application https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-eventhub-streaming/ Regards, Sai On Thu, Sep 15, 2016 at 10:44 AM, kiaia [via Apache Spark User List]

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-15 Thread Michael Armbrust
Is what you are looking for a withColumn that support in place modification of nested columns? or is it some other problem? On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I tried to use the RowEncoder but got stuck along the way : > The main issue

RE: Spark processing Multiple Streams from a single stream

2016-09-15 Thread ayan guha
You may consider writing back to Kafka from main stream and then have downstream consumers. This will keep things modular and independent. On 15 Sep 2016 23:29, "Udbhav Agarwal" wrote: > Thank you Ayan for a reply. > > Source is kafka but I am reading from this source

Issues while running MLlib matrix factorization ALS algorithm

2016-09-15 Thread Roshani Nagmote
Hi, I need help to run matrix factorization ALS algorithm in Spark MLlib. I am using dataset(1.5Gb) having 480189 users and 17770 items formatted in similar way as Movielens dataset. I am trying to run MovieLensALS example jar on this dataset on AWS Spark EMR cluster having 14 M4.2xlarge

Re: Guide Step by step Stark streaming

2016-09-15 Thread rahulkumar-aws
Really your project is very nice and you can simulate various domain use case with this, as this is your college project I can't help you in coding but I can share a presentation https://goo.gl/XUJd3b of mine that will give you a graphical diagram of a real-life system

Missing output partition file in S3

2016-09-15 Thread Chen, Kevin
Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been

Re: Streaming - lookup against reference data

2016-09-15 Thread Tom Davis
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is encouraging. I've not used Redis, but it does seem that for most of my current and likely future use-cases it would be the best fit (nice compromise of scale and easy setup / access). Thanks, Tom On Wed, Sep 14, 2016 at

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Jörn Franke
Hi, I recommend that the third party application puts an empty file with the same filename as the original file, but the extension ".uploaded". This is an indicator that the file has been fully (!) written to the fs. Otherwise you risk only reading parts of the file. Then, you can have a file

countApprox

2016-09-15 Thread Stefano Lodi
I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran countApprox with different parameters but I failed to generate any approximate output. In all runs it returns the exact number of elements. What is the effect of approximation in countApprox supposed to be, and for

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-15 Thread neil90
You need to use 2.0.0-M2-s_2.11 since Spark 2.0 is compiled with Scala 2.11 by default. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Write-to-Cassandra-table-from-pyspark-fails-with-scala-reflect-error-tp27723p27729.html Sent from the Apache Spark User

Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Kappaganthu, Sivaram (ES)
Hello, I am a newbie to spark and I have below requirement. Problem statement : A third party application is dumping files continuously in a server. Typically the count of files is 100 files per hour and each file is of size less than 50MB. My application has to process those files. Here

Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-15 Thread Dirceu Semighini Filho
Hi Felix, Are sure your n is greater than 0? Here it stops first at breakpoint 1, image attached. Have you got the count to see if it's also greater than 0? 2016-09-15 11:41 GMT-03:00 chen yong : > Dear Dirceu > > > Thank you for your help. > > > Acutally, I use Intellij IDEA

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-15 Thread Russell Spitzer
If the download fails you have to start figuring out if you have network issues or if your local cache is messed up :( I would see if you can manually pull that artifact or try running through just spark-shell first to see if that gives any more verbose output. On Thu, Sep 15, 2016 at 6:48 AM

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
Yeah. If you're looking to reduce over more than one microbatch/rdd, there's also reduceByKeyAndWindow On Thu, Sep 15, 2016 at 4:27 AM, Daan Debie wrote: > I have another (semi-related) question: I see in the documentation that > DStream has a transformation reduceByKey.

Re: LIVY VS Spark Job Server

2016-09-15 Thread Vadim Semenov
I have experience with both Livy & spark-jobserver. spark-jobserver gives you better API, particularly, if you want to work within a single spark context. Livy supports submitting python & R code while spark-jobserver doesn't support it. spark-jobserver code is more complex, it actively uses

Re: Write to Cassandra table from pyspark fails with scala reflect error [RESOLVED]

2016-09-15 Thread Trivedi Amit
There was some environment issue. I basically removed all environment variables and with 2.11 and it worked. Thanks for help. From: Trivedi Amit To: Russell Spitzer ; "user@spark.apache.org" Sent: Thursday,

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Jeff Nadler
Yes we do something very similar and it's working well: Kafka -> Spark Streaming (write temp files, serialized RDDs) -> Spark Batch Application (build partitioned Parquet files on HDFS; this is needed because building Parquet files of a reasonable size is too slow for streaming) -> query with

答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-15 Thread chen yong
Dear Dirceu Thank you for your help. Acutally, I use Intellij IDEA to dubug the spark code. Let me use the following code snippet to illustrate my problem. In the code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2. when i debuged the code, it did not stop at

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sean Owen
If your core requirement is ad-hoc real-time queries over the data, then the standard Hadoop-centric answer would be: Ingest via Kafka, maybe using Flume, or possibly Spark Streaming, to read and land the data, in... Parquet on HDFS or possibly Kudu, and Impala to query >> On 15 September 2016

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Sachin Janani
Hi Mich, I agree that the technology stack that you describe is more difficult to manage due to different components (like HDFS,Flume,Kafka etc) involved. The solution to this problem could be, to have some DB which has the capability to support mix workloads (OLTP,OLAP,Streaming etc) and I think

Partition RDD based on K-Means Clusters

2016-09-15 Thread Punit Naik
Hi Guys I have run k-means algorithm on my data and its has classified it into the number of clusters that I have defined. Suppose my data is sc.parallelize(Array(1,2,3,4)) and if [1,2] belong to cluster '1' and [3,4] belong to cluster '2', how can I define a custom partitioner so that the number

Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-15 Thread Trivedi Amit
Thanks Russell. I didn't build this myself. I tried with Scala 2.11 com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M(1-3) and I am getting ``` Exception in thread "main" java.lang.RuntimeException: [download failed: org.scala-lang#scala-reflect;2.11.8!scala-reflect.jar]     at

Re: Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
any ideas on this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk.

RE: Spark processing Multiple Streams from a single stream

2016-09-15 Thread Udbhav Agarwal
Thank you Ayan for a reply. Source is kafka but I am reading from this source in my main stream. I will perform some operations here. Then I want to send the output of these operation to 4 parallel tasks. For these 4 parallel tasks I want 4 new streams. Is such an implementation possible here ?

Re: Spark processing Multiple Streams from a single stream

2016-09-15 Thread ayan guha
Depending on source. For example, if source is Kafka then you can write 4 streaming consumers. On 15 Sep 2016 20:11, "Udbhav Agarwal" wrote: > Hi All, > > I have a scenario where I want to process a message in various ways in > parallel. For instance a message is

Re: Spark job within Web application

2016-09-15 Thread rahulkumar-aws
Hi, As I see your code it looks like you are trying to call spark code inside your servlet, but my perspective try to make the separate system and do communication using Thrift or Protobuf libraries. I build various distributed web app using Play Framework, Akka, Spray and jersey it is working

Total Shuffle Read and Write Size of Spark workload

2016-09-15 Thread Cristina Rozee
Hello, I am running a spark application and I would like to know the total amount of shuffle data (read + write ) so could anyone let me know how to get this information? Thank you Cristina.

Spark processing Multiple Streams from a single stream

2016-09-15 Thread Udbhav Agarwal
Hi All, I have a scenario where I want to process a message in various ways in parallel. For instance a message is coming inside spark stream(DStream) and I want to send this message to 4 different tasks in parallel. I want these 4 different tasks to be separate streams in the original spark

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Daan Debie
I have another (semi-related) question: I see in the documentation that DStream has a transformation reduceByKey. Does this work on _all_ elements in the stream, as they're coming in, or is this a transformation per RDD/micro batch? I assume the latter, otherwise it would be more akin to

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Mich Talebzadeh
Where are you reading data from Chanh? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use

Best way to present data collected by Flume through Spark

2016-09-15 Thread Mich Talebzadeh
Hi, This is for fishing for some ideas. In the design we get prices directly through Kafka into Flume and store it on HDFS as text files We can then use Spark with Zeppelin to present data to the users. This works. However, I am aware that once the volume of flat files rises one needs to do

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Chanh Le
Yes, I used to experience that issue before even I set 1 hour cron for a job to update Spark Cache but it crashed. I didn’t check recently release but ZP cron is not stable I think. > On Sep 15, 2016, at 2:47 PM, Mich Talebzadeh > wrote: > > Thanks Chanh, > > I

Spark job failing with Adjusted frame length exceeds 2147483647: 2222367317 - discarded

2016-09-15 Thread Trinadh Kaja
Hi Team, I am encountering the below error while running a spark job.here i am mentioning spark submit command for reference. I request you to suggest me to avoid the below error and to let the program to run with out errors. CDH version: 5.6.0 Spark version: 1.5.0 spark-submit --deploy-mode

Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
Yes thanks. I had flume already for twitter so configured it to get data from Kafka source and post it to HDFS. cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Mich Talebzadeh
Thanks Chanh, I noticed one thing. If you put on a cron refresh say every 30 seconds after a whilt the job crashes with OOM error. Then I stop and restart Zeppelin daemon and it works again! Have you come across it? cheers Dr Mich Talebzadeh LinkedIn *

Re: Using Zeppelin with Spark FP

2016-09-15 Thread Chanh Le
Hi, I am using Zeppelin 0.7 snapshot and it works well both Spark 2.0 and STS of Spark 2.0. > On Sep 12, 2016, at 4:38 PM, Mich Talebzadeh > wrote: > > Hi Sachin, > > Downloaded Zeppelin 0.6.1 > > Now I can see the plot in a tabular format and graph. it looks

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-15 Thread Olivier Girardot
I tried to use the RowEncoder but got stuck along the way :The main issue really is that even if it's possible (however tedious) to pattern match generically Row(s) and target the nested field that you need to modify, Rows being immutable data structure without a method like a case class's copy or