Re: restart from last successful stage

2015-07-29 Thread Alex Nastetsky
I meant a restart by the user, as ayan said. I was thinking of a case where e.g. a Spark conf setting wrong and the job failed in Stage 1, in my example .. and we want to rerun the job with the right conf without rerunning Stage 0. Having this re-start capability may cause some chaos if it would

Re: Spark - Eclipse IDE - Maven

2015-07-29 Thread Venkat Reddy
Thanks Petar, I will purchase. Thanks for the input. Thanks Siva On Tue, Jul 28, 2015 at 4:39 PM, Carol McDonald cmcdon...@maprtech.com wrote: I agree, I found this book very useful for getting started with spark and eclipse On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic

streamingContext.stop(true,true) doesn't end the job

2015-07-29 Thread mike
Hi all, I have a streaming job which reads messages from kafka via directStream, transforms it and writes it out to Cassandra. Also I use a large broadcast variable (~300MB) I tried to implement a stopping mechanism for this job , something like this : When I test it in local mode, the job

HiveQL to SparkSQL

2015-07-29 Thread Bigdata techguy
Hi All, I have a fairly complex HiveQL data processing which I am trying to convert to SparkSQL to improve performance. Below is what it does. Select around 100 columns including Aggregates From a FACT_TABLE Joined to the summary of the same FACT_TABLE Joined to 2 smaller DIMENSION tables. The

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
This question was answered with sample code a couple of days ago, please look back. On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic zoran.jere...@gmail.com wrote: Hi, I discovered what is the problem here. Twitter public stream is limited to 1% of overall tweets (https://goo.gl/kDwnyS), so

sc.parallelize(512k items) doesn't always use 64 executors

2015-07-29 Thread Kostas Kougios
Hi, I do an sc.parallelize with a list of 512k items. But sometimes not all executors are used, i.e. they don't have work to do and nothing is logged after: 15/07/29 16:35:22 WARN internal.ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?

Re: Too many open files

2015-07-29 Thread Igor Berman
you probably should increase file handles limit for user that all processes are running with(spark master workers) e.g. http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ On 29 July 2015 at 18:39, saif.a.ell...@wellsfargo.com wrote: Hello, I’ve seen a couple

RE: Too many open files

2015-07-29 Thread Saif.A.Ellafi
Thank you both, I will take a look, but 1. For high-shuffle tasks, is this right for the system to have the size and thresholds high? I hope there is no bad consequences. 2. I will try to overlook admin access and see if I can get anything with only user rights From: Ted Yu

stopped SparkContext remaining active

2015-07-29 Thread Andres Perez
Hi everyone. I'm running into an issue with SparkContexts when running on Yarn. The issue is observable when I reproduce these steps in the spark-shell (version 1.4.1): scala sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee *Note the pointer address of sc. (Then

Re: Spark - Eclipse IDE - Maven

2015-07-29 Thread Dean Wampler
If you don't mind using SBT with your Scala instead of Maven, you can see the example I created here: https://github.com/deanwampler/spark-workshop It can be loaded into Eclipse or IntelliJ Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: How to read a Json file with a specific format?

2015-07-29 Thread mélanie gallois
Can you give an example with my extract? Mélanie Gallois 2015-07-29 16:55 GMT+02:00 Young, Matthew T matthew.t.yo...@intel.com: The built-in Spark JSON functionality cannot read normal JSON arrays. The format it expects is a bunch of individual JSON objects without any outer array syntax,

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2}]}

Re: Too many open files

2015-07-29 Thread Ted Yu
Please increase limit for open files: http://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux On Jul 29, 2015, at 8:39 AM, saif.a.ell...@wellsfargo.com saif.a.ell...@wellsfargo.com wrote: Hello, I’ve seen a couple emails on this issue but

Error in starting Spark Streaming Context

2015-07-29 Thread Sadaf
Hi I am new to Spark Streaming and writing a code for twitter connector. when i run this code more than one time, it gives the following exception. I have to create a new hdfs directory for checkpointing each time to make it run successfully and moreover it doesn't get stopped. ERROR

Re: stopped SparkContext remaining active

2015-07-29 Thread Ted Yu
bq. it seems like we never get to the clearActiveContext() call by the end Looking at stop() method, there is only one early return after stopped.compareAndSet() call. Is there any clue from driver log ? Cheers On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez and...@tresata.com wrote: Hi

Re: Count of distinct values in each column

2015-07-29 Thread Alexander Krasheninnikov
I made such naive implementation: SparkConf conf =newSparkConf(); conf.setMaster(local[4]).setAppName(Stub); finalJavaSparkContext ctx =newJavaSparkContext(conf); JavaRDDString input = ctx.textFile(path_to_file); // explode each line into list of column values JavaRDDListString rowValues =

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Zoran Jeremic
Can you send me the subject of that email? I can't find any email suggesting solution to that problem. There is email *Twitter4j streaming question*, but it doesn't have any sample code. It just confirms what I explained earlier that without filtering Twitter will limit to 1% of tweets, and if you

Re: Spark Interview Questions

2015-07-29 Thread Pedro Rodriguez
You might look at the edx course on Apache Spark or ML with Spark. There are probably some homework problems or quiz questions that might be relevant. I haven't looked at the course myself, but thats where I would go first.

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Enno Shioji
If you start parallel Twitter streams, you will be in breach of their TOS. They allow a small number of parallel stream in practice, but if you do it on massive scale they'll ban you (I'm speaking from experience ;) ). If you really need that level of data, you need to talk to a company called

broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
Hi I am using spark streaming 1.3 and using checkpointing. But job is failing to recover from checkpoint on restart. For broadcast variable it says : 1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP): java.lang.ClassCastException: [B cannot be cast to

Re: How to read a Json file with a specific format?

2015-07-29 Thread Michael Armbrust
This isn't totally correct. Spark SQL does support JSON arrays and will implicitly flatten them. However, complete objects or arrays must exist one per line and cannot be split with newlines. On Wed, Jul 29, 2015 at 7:55 AM, Young, Matthew T matthew.t.yo...@intel.com wrote: The built-in

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from checkpoint) dstream.transform { rdd = val accum = SingletonObject.getOrCreateAccumulator() // single object method to

Re: HiveQL to SparkSQL

2015-07-29 Thread Jörn Franke
What Hive Version are you using? Do you run it in on TEZ? Are you using the ORC Format? Do you use compression? Snappy? Do you use Bloom filters? Do you insert the data sorted on the right columns? Do you use partitioning? Did you increase the replication factor for often used tables or

Passing SPARK_CONF_DIR to slaves in standalone mode under Grid Engine job

2015-07-29 Thread David Chin
Hi, all, I am just setting up to run Spark in standalone mode, as a (Univa) Grid Engine job. I have been able to set up the appropriate environment variables such that the master launches correctly, etc. In my setup, I generate GE job-specific conf and log dirs. However, I am finding that the

Difference between RandomForestModel and RandomForestClassificationModel

2015-07-29 Thread praveen S
Hi Wanted to know what is the difference between RandomForestModel and RandomForestClassificationModel? in Mlib.. Will they yield the same results for a given dataset?

Re: Spark WebUI link problem in Mesos Master

2015-07-29 Thread Anton Kirillov
Hi Tim, thanks for clarifying this! Regarding dispatcher and cluster mode, it seems, that there’s still no way to get to exact Spark executor UI but only driver configuration and execution stats. -- Anton Kirillov Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday,

RE: Too many open files

2015-07-29 Thread Paul Röwer
Maybe you forgot Tod close a reader Ort writer object. Am 29. Juli 2015 18:04:59 MESZ, schrieb saif.a.ell...@wellsfargo.com: Thank you both, I will take a look, but 1. For high-shuffle tasks, is this right for the system to have the size and thresholds high? I hope there is no bad

Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread unk1102
Hi I have Spark Streaming code which streams from Kafka topic it used to work fine but suddenly it started throwing the following exception Exception in thread main org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at

Re: Spark WebUI link problem in Mesos Master

2015-07-29 Thread Tim Chen
Hi Anton, Client mode we haven't populated the webui link and only did so for cluster mode. If you like you can open a JIRA and it should be a easy ticket for anyone to work on. Tim On Wed, Jul 29, 2015 at 4:27 AM, Anton Kirillov antonv.kiril...@gmail.com wrote: Hi everyone, I’m trying to

RE: IP2Location within spark jobs

2015-07-29 Thread Young, Matthew T
You can put the database files in a central location accessible to all the workers and build the GeoIP object once per-partition when you go to do a mapPartitions across your dataset, loading from the central location. ___ From: Filli Alem [alem.fi...@ti8m.ch] Sent: Wednesday, July 29, 2015

IP2Location within spark jobs

2015-07-29 Thread Filli Alem
Hi, I would like to use ip2Location databases during my spark jobs (MaxMind). So far I haven't found a way to properly serialize the database offered by the Java API of the database. The CSV version isn't easy to handle as it contains of multiple files. Any recommendations on how to do this?

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
'How to restart Twitter spark stream' i It may not be exactly what you are looking for, but i thought it did touch on some aspect of your question. On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic zoran.jere...@gmail.com wrote: Can you send me the subject of that email? I can't find any email

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Zoran Jeremic
Actually, I posted that question :) I already implemented solution that Akhil suggested there , and that solution is using Sample tweets API, which returns only 1% of the tweets. It would not work in my scenario of use. For the hashtags I'm interested in, I need to catch each single tweet, not

Re: stopped SparkContext remaining active

2015-07-29 Thread Andres Perez
Hi Ted. Taking a look at the logs, I get the feeling like there may be an uncaught exception blowing up the SparkContext.stop method, causing it to not reach the line where it gets set as inactive. The line referenced below in SparkContext (SparkContext.scala:1644) is the call:

Re: NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2015-07-29 Thread Ted Yu
You can generate dependency tree using: mvn dependency:tree and grep for 'org.scala-lang' in the output to see if there is any clue. Cheers On Wed, Jul 29, 2015 at 5:14 PM, Benjamin Ross br...@lattice-engines.com wrote: Hello all, I’m new to both spark and scala, and am running into an

Re: Authentication Support with spark-submit cluster mode

2015-07-29 Thread Zhan Zhang
If you run it on yarn with kerberos setup. You authenticate yourself by kinit before launching the job. Thanks. Zhan Zhang On Jul 28, 2015, at 8:51 PM, Anh Hong hongnhat...@yahoo.com.INVALIDmailto:hongnhat...@yahoo.com.INVALID wrote: Hi, I'd like to remotely run spark-submit from a local

Re: restart from last successful stage

2015-07-29 Thread Tathagata Das
If you are changing the SparkConf, that mean you have recreate the SparkContext, isnt it? So you have to stop the previous SparkCotnext which deletes all the information about stages that have been run. So the better approach is to indeed save the data of the last stage explicitly and then try

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I have Spark Streaming code which streams from Kafka

NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2015-07-29 Thread Benjamin Ross
Hello all, I'm new to both spark and scala, and am running into an annoying error attempting to prototype some spark functionality. From forums I've read online, this error should only present itself if there's a version mismatch between the version of scala used to compile spark and the scala

RE: NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2015-07-29 Thread Benjamin Ross
Hey Ted, Thanks for the quick response. Sadly, all of those are 2.10.x: ─$ mvn dependency:tree | grep -A 2 -B 2 org.scala-lang 130 ↵ [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | \-

How to set log level in spark-submit ?

2015-07-29 Thread canan chen
Anyone know how to set log level in spark-submit ? Thanks

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-29 Thread Jeff Zhang
This may be what you want val conf = new SparkConf().setMaster(local).setAppName(test) val sc = new SparkContext(conf) val inputRdd = sc.parallelize(Array((key_1, a), (key_1,b), (key_2,c), (key_2, d))) val result = inputRdd.groupByKey().flatMap(e={ val key= e._1 val valuesWithIndex =

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
1.How to do it in java? 2.Can broadcast objects also be created in same way after checkpointing. 3.Is it safe If I disable checkpoint and write offsets at end of each batch to hdfs in mycode and somehow specify in my job to use this offset for creating kafkastream at first time. How can I

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Umesh Kacha
Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote: The last time someone brought this up on the mailing list, the issue

Re: How to set log level in spark-submit ?

2015-07-29 Thread Jonathan Coveney
Put a log4j.properties file in conf/. You can copy log4j.properties.template as a good base El miércoles, 29 de julio de 2015, canan chen ccn...@gmail.com escribió: Anyone know how to set log level in spark-submit ? Thanks

Re: Authentication Support with spark-submit cluster mode

2015-07-29 Thread Anh Hong
Hi Zhan,I'm running Standalone Spark cluster and execute spark-submit from a local host outside the cluster. Beside kerberos, do you know any other existing method? Is there any JIRA opened on this enhancement request? Regards,Anh. On Wednesday, July 29, 2015 4:15 PM, Zhan Zhang

Re: Graceful shutdown for Spark Streaming

2015-07-29 Thread anshu shukla
If we want to stop the application after fix-time period , how it will work . (How to give the duration in logic , in my case sleep(t.s.) is not working .) So i used to kill coarseGrained job at each slave by script .Please suggest something . On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Cody Koeninger
The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job was running. On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das t...@databricks.com wrote: There is a known issue that Kafka cannot return leader

help plz! how to use zipWithIndex to each subset of a RDD

2015-07-29 Thread askformore
I have some data like this:RDD[(String, String)] = ((*key-1*, a), (*key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d))and I want to group the data by Key, and for each group, add index fields to the groupmember, at last I can transform the data to below : RDD[(String, *Int*, String)] =

Re: How to set log level in spark-submit ?

2015-07-29 Thread canan chen
Yes, that should work. What I mean is is there any option in spark-submit command that I can specify for the log level On Thu, Jul 30, 2015 at 10:05 AM, Jonathan Coveney jcove...@gmail.com wrote: Put a log4j.properties file in conf/. You can copy log4j.properties.template as a good base El

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-29 Thread ayan guha
Is there a relationship between data and index? I.e with a,b,c to 1,2,3? On 30 Jul 2015 12:13, askformore askf0rm...@163.com wrote: I have some data like this: RDD[(String, String)] = ((*key-1*, a), ( *key-1*,b), (*key-2*,a), (*key-2*,c),(*key-3*,b),(*key-4*,d)) and I want to group the data by

Re: Job hang when running random forest

2015-07-29 Thread Asher Krim
Did you get a thread dump? We have experienced similar problems during shuffle operations due to a deadlock in InetAddress. Specifically, look for a runnable thread at something like java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method). Our solution has been to put a timeout around the code

Re: Spark Streaming

2015-07-29 Thread Gerard Maas
A side question: Any reason why you're using window(Seconds(10), Seconds(10)) instead of new StreamingContext(conf, Seconds(10)) ? Making the micro-batch interval 10 seconds instead of 1 will provide you the same 10-second window with less complexity. Of course, this might just be a test for the

Re: RDDs join problem: incorrect result

2015-07-29 Thread ๏̯͡๏
What is the size of each RDD? Size of your cluster spark configurations that you tried out. On Tue, Jul 28, 2015 at 9:54 PM, ponkin alexey.pon...@ya.ru wrote: Hi, Alice Did you find solution? I have exactly the same problem. -- View this message in context:

R: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-29 Thread Paolo Platter
Try to give a look at zoomdata. They are spark based and they offer BI features with good performance. Paolo Inviata dal mio Windows Phone Da: Ruslan Dautkhanovmailto:dautkha...@gmail.com Inviato: ‎29/‎07/‎2015 06:18 A: renga.kannanmailto:renga.kan...@gmail.com

Re: error in twitter streaming

2015-07-29 Thread Sadaf
thanks for the suggestion Akashsihag. i've tried this solution and unfortunately it is also giving the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-in-twitter-streaming-tp24030p24056.html Sent from the Apache Spark User List mailing

Spark Streaming

2015-07-29 Thread Sadaf
Hi, I am new to Spark Streaming and writing a code for twitter connector. I am facing the following exception. ERROR StreamingContext: Error starting the context, marking it as stopped org.apache.spark.SparkException: org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been

PermGen Space Error

2015-07-29 Thread Sarath Chandra
Dear All, I'm using - = Spark 1.2.0 = Hive 0.13.1 = Mesos 0.18.1 = Spring = JDK 1.7 I've written a scala program which = instantiates a spark and hive context = parses an XML file which provides the where clauses for queries = generates full fledged hive queries to be run on hive

Re: PermGen Space Error

2015-07-29 Thread fightf...@163.com
Hi, Sarath Did you try to use and increase spark.excecutor.extraJaveOptions -XX:PermSize= -XX:MaxPermSize= fightf...@163.com From: Sarath Chandra Date: 2015-07-29 17:39 To: user@spark.apache.org Subject: PermGen Space Error Dear All, I'm using - = Spark 1.2.0 = Hive 0.13.1 = Mesos

Re: Spark Number of Partitions Recommendations

2015-07-29 Thread ponkin
Hi Rahul, Where did you see such a recommendation? I personally define partitions with the following formula partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) ) where nextPrimeNumberAbove(x) - prime number which is greater than x K - multiplicator to calculate start

Spark Interview Questions

2015-07-29 Thread Mishra, Abhishek
Hello, Please help me with links or some document for Apache Spark interview questions and answers. Also for the tools related to it ,for which questions could be asked. Thanking you all. Sincerely, Abhishek - To

Lambda serialization

2015-07-29 Thread Subshiri S
Hi, I have tried to use lambda expression in spark task, And it throws java.lang.IllegalArgumentException: Invalid lambda deserialization exception. It exception is thrown when I used the code like transform(pRDD-pRDD.map(t-t._2)) . The code snippet is below. JavaPairDStreamString,Integer

Re: PermGen Space Error

2015-07-29 Thread Sarath Chandra
Yes. As mentioned in my mail at the end, I tried with both 256 and 512 options. But the issue persists. I'm giving following parameters to spark configuration - spark.core.connection.ack.wait.timeout=600 spark.akka.timeout=1000 spark.akka.framesize=50 spark.executor.memory=2g spark.task.cpus=2

Re: Spark Number of Partitions Recommendations

2015-07-29 Thread Igor Berman
imho, you need to take into account size of your data too if your cluster is relatively small, you may cause memory pressure on your executors if trying to repartition to some #cores connected number of partitions better to take some max between initial number of partitions(assuming your data is

Spark WebUI link problem in Mesos Master

2015-07-29 Thread Anton Kirillov
Hi everyone, I’m trying to get access to Spark web UI from Mesos Master but with no success: the host name displayed properly, but the link is not active, just text. Maybe it’s a well-known issue or I misconfigured something, but this problem is really annoying. Jobs are executed in client

Re: PermGen Space Error

2015-07-29 Thread Sean Owen
Yes, I think this was asked because you didn't say what flags you set before, and it's worth verifying they're the correct ones. Although I'd be kind of surprised if 512m isn't enough, did you try more? You could also try -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled Also verify

Re: PermGen Space Error

2015-07-29 Thread Sarath Chandra
Hi Sean, Thanks for your response. Few lame questions - Where to check if the driver and executor have started with the provided options? - As said I'm using mesos cluster. So it should appear in the task logs under frameworks, correct? - What if provided options not appearing in those

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-29 Thread Nicolas Phung
Hello, I'm using 4Gb for the driver memory. The checkpoint is between 1 Gb and 10 Gb depending if I'm reprocessing all the data from beginning or just getting the latest offset from the real time processed. Is there any best practice to be aware of with driver memory relating to checkpoint size ?

FW: Executing spark code in Zeppelin

2015-07-29 Thread Stefan Panayotov
Stefan Panayotov Sent from my Windows Phone From: Stefan Panayotovmailto:spanayo...@msn.com Sent: ‎7/‎29/‎2015 8:20 AM To: user-subscr...@spark.apache.orgmailto:user-subscr...@spark.apache.org Subject: Executing spark code in Zeppelin Hi, I faced a problem with

Graceful shutdown for Spark Streaming

2015-07-29 Thread Michal Čizmazia
How to initiate graceful shutdown from outside of the Spark Streaming driver process? Both for the local and cluster mode of Spark Standalone as well as EMR. Does sbin/stop-all.sh stop the context gracefully? How is it done? Is there a signal sent to the driver process? For EMR, is there a way

Simple Map Reduce taking lot of time

2015-07-29 Thread Varadharajan Mukundan
Hi All, I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB CSV file with 10 columns and i'm need of separating it into multiple CSV/Parquet files based on one of the fields in the CSV file. I've loaded the CSV file using spark-csv and applied the below transformations. It

How to read a Json file with a specific format?

2015-07-29 Thread SparknewUser
I'm trying to read a Json file which is like : [ {IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2} ,{MLrate:30,Nrout:0,up:null,Crate:2} ,{MLrate:30,Nrout:0,up:null,Crate:2} ,{MLrate:30,Nrout:0,up:null,Crate:2} ,{MLrate:30,Nrout:0,up:null,Crate:2}

Count of distinct values in each column

2015-07-29 Thread Devi P.V
Hi All, I have a 5GB CSV dataset having 69 columns..I need to find the count of distinct values in each column. What is the optimized way to find the same using spark scala? Example CSV format : a,b,c,d a,c,b,a b,b,c,d b,b,c,a c,b,b,a Output expecting : (a,2),(b,2),(c,1) #- First column

Re: Graceful shutdown for Spark Streaming

2015-07-29 Thread Tathagata Das
StreamingContext.stop(stopGracefully = true) stops the streaming context gracefully. Then you can safely terminate the Spark cluster. They are two different steps and needs to be done separately ensuring that the driver process has been completely terminated before the Spark cluster is the

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
1. Same way, using static fields in a class. 2. Yes, same way. 3. Yes, you can do that. To differentiate from first time v/s continue, you have to build your own semantics. For example, if the location in HDFS you are suppose to store the offsets does not have any data, that means its probably

Re: Spark Interview Questions

2015-07-29 Thread vaquar khan
Hi Abhishek, Please learn spark ,there are no shortcuts for sucess. Regards, Vaquar khan On 29 Jul 2015 11:32, Mishra, Abhishek abhishek.mis...@xerox.com wrote: Hello, Please help me with links or some document for Apache Spark interview questions and answers. Also for the tools related to

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
The built-in Spark JSON functionality cannot read normal JSON arrays. The format it expects is a bunch of individual JSON objects without any outer array syntax, with one complete JSON object per line of the input file. AFAIK your options are to read the JSON in the driver and parallelize it

RE: Spark Interview Questions

2015-07-29 Thread Mishra, Abhishek
Hello Vaquar, I have working knowledge and experience in Spark. I just wanted to test or do a mock round to evaluate myself. Thank you for the reply, Please share something if you have for the same. Sincerely, Abhishek From: vaquar khan [mailto:vaquar.k...@gmail.com] Sent: Wednesday, July 29,

Anyone using Intel's spark-streamingsql project to execute SQL queries over Spark streaming ?

2015-07-29 Thread Sela, Amit
Is there an available release? Anyone using in production? Is the project being actively developed and maintained? Thanks!

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-29 Thread Cody Koeninger
@Ashwin you don't need to append the topic to your data if you're using the direct stream. You can get the topic from the offset range, see http://spark.apache.org/docs/latest/streaming-kafka-integration.html (search for offsetRange) If you're using the receiver based stream, you'll need to

RE: Executing spark code in Zeppelin

2015-07-29 Thread Stefan Panayotov
Hi Silvio, There are no errors reported. Just the Zeppelin paragraph refuses to do anything. I tested this by just adding comment characters to the code snippet. Also, I discovered the limit is 3400. Once I go over that it stops reacting. BTW, thanks for the advice - I sent email to the