Spark job endup with NPE

2015-08-14 Thread hide
Hello, I'm using spark on yarn cluster and using mongo-hadoop-connector to pull data to spark, doing some job The job has following stage. (flatMap - flatMap - reduceByKey - sortByKey) The data in MongoDB is tweet from twitter. First, connect to mongodb and make RDD by following val mongoRDD

Re: Driver staggering task launch times

2015-08-14 Thread Philip Weaver
Ah, nevermind, I don't know anything about scheduling tasks in YARN. On Thu, Aug 13, 2015 at 11:03 PM, Ara Vartanian arav...@cs.wisc.edu wrote: I’m running on Yarn. On Aug 13, 2015, at 10:58 PM, Philip Weaver philip.wea...@gmail.com wrote: Are you running on mesos, yarn or standalone? If

Re: Streaming on Exponential Data

2015-08-14 Thread Hemant Bhanawat
What does exponential data means? Does this mean that the amount of the data that is being received from the stream in a batchinterval is increasing exponentially as the time progresses? Does your process have enough memory to handle the data for a batch interval? You may want to share Spark

Exception in spark

2015-08-14 Thread Ravisankar Mani
Hi all, I got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. But its perfectly working in hive . Please refer the following query. I have

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Yep, and it works fine for operations which does not involve any shuffle (like foreach,, count etc) and those which involves shuffle operations ends up in an infinite loop. Spark should somehow indicate this instead of going in an infinite loop. Thanks Best Regards On Thu, Aug 13, 2015 at 11:37

Fwd: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
-- Forwarded message -- From: Dawid Wysakowicz wysakowicz.da...@gmail.com Date: 2015-08-14 9:32 GMT+02:00 Subject: Re: Using unserializable classes in tasks To: mark manwoodv...@googlemail.com I am not an expert but first of all check if there is no ready connector (you mentioned

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Mridul Muralidharan
What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are used in spark [1]. They cannot be arbitrarily numbered,have duplicates, etc. Extending RDD to add functionality is typically for niche

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Jeff Zhang
Data skew ? May your partition key has some special value like null or empty string On Fri, Aug 14, 2015 at 11:01 AM, randylu randyl...@gmail.com wrote: It is strange that there are always two tasks slower than others, and the corresponding partitions's data are larger, no matter how many

Re: Driver staggering task launch times

2015-08-14 Thread Ara Vartanian
I’m running on Yarn. On Aug 13, 2015, at 10:58 PM, Philip Weaver philip.wea...@gmail.com wrote: Are you running on mesos, yarn or standalone? If you're on mesos, are you using coarse grain or fine grained mode? On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian arav...@cs.wisc.edu

Re: Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-14 Thread Anish Haldiya
Hi, If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle. However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in

Re:Re: Materials for deep insight into Spark SQL

2015-08-14 Thread Todd
Thanks Ted for the help! At 2015-08-14 12:02:14, Ted Yu yuzhih...@gmail.com wrote: You can look under Developer Track: https://spark-summit.org/2015/#day-1 http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly old) Catalyst design:

Using unserializable classes in tasks

2015-08-14 Thread mark
I have a Spark job that computes some values and needs to write those values to a data store. The classes that write to the data store are not serializable (eg, Cassandra session objects etc). I don't want to collect all the results at the driver, I want each worker to write the data - what is

Re: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
No the connector does not need to be serializable cause it is constructed on the worker. Only objects shuffled across partitions needs to be serializable. 2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com: I guess I'm looking for a more general way to use complex graphs of objects that

Re: matrix inverse and multiplication

2015-08-14 Thread go canal
Correction: I am not able to convert the Scala statement to java.

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread Stephen Boesch
The NoClassDefFoundException differs from ClassNotFoundException : it indicates an error while initializing that class: but the class is found in the classpath. Please provide the full stack trace. 2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com: Hello, I am just starting out with

Left outer joining big data set with small lookups

2015-08-14 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema type]. I am using data frame in spark sql. It looks like data is shuffled and skewed when that join happens. Is there any way to improve performance

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Akhil Das
Which version of spark are you using? Did you try with --driver-class-path configuration? Thanks Best Regards On Fri, Aug 14, 2015 at 2:05 PM, Kyle Lin kylelin2...@gmail.com wrote: Hi all I had similar usage and also got the same problem. I guess Spark use some class in my user jars but

Re: RDD.join vs spark SQL join

2015-08-14 Thread Akhil Das
Both works the same way, but with SparkSQL you will get the optimization etc done by the catalyst. One important thing to consider is the # partitions and the key distribution (when you are doing RDD.join), If the keys are not evenly distributed across machines then you can see the process

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Thanks for the clarifications Mrithul. Thanks Best Regards On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan mri...@gmail.com wrote: What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Kyle Lin
Hi all I had similar usage and also got the same problem. I guess Spark use some class in my user jars but actually it should use the class in spark-assembly-xxx.jar, but I don't know how to fix it. Kyle 2015-07-22 23:03 GMT+08:00 Ashish Soni asoni.le...@gmail.com: Hi All , I am getting

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-14 Thread Akhil Das
Looks like a jar version conflict to me. Thanks Best Regards On Thu, Aug 13, 2015 at 7:59 PM, satish chandra j jsatishchan...@gmail.com wrote: HI, Please let me know if I am missing anything in the below mail, to get the issue fixed Regards, Satish Chandra On Wed, Aug 12, 2015 at 6:59

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Silvio Fiorito
You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema

Re: Spark job endup with NPE

2015-08-14 Thread Akhil Das
You can put a try..catch around all the transformations that you are doing and catch such exceptions instead of crashing your entire job. Thanks Best Regards On Fri, Aug 14, 2015 at 4:35 PM, hide x22t33...@gmail.com wrote: Hello, I'm using spark on yarn cluster and using

Re: How to specify column type when saving DataFrame as parquet file?

2015-08-14 Thread Raghavendra Pandey
I think you can try dataFrame create api that takes RDD[Row] and Struct type... On Aug 11, 2015 4:28 PM, Jyun-Fan Tsai jft...@appier.com wrote: Hi all, I'm using Spark 1.4.1. I create a DataFrame from json file. There is a column C that all values are null in the json file. I found that

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-14 Thread Cody Koeninger
There's a long recent thread in this list about stopping apps, subject was stopping spark stream app at 1 second I wouldn't run repeated rdds, no. I'd take a look at subclassing, personally (you'll have to rebuild the streaming kafka project since a lot is private), but if topic changes dont

Cannot cast to Tuple when running in cluster mode

2015-08-14 Thread Saif.A.Ellafi
Hi All, I have a working program, in which I create two big tuples2 out of the data. This seems to work in local but when I switch over cluster standalone mode, I get this error at the very beggining: 15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10, 162.101.194.44):

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
How do I get beyond the This post has NOT been accepted by the mailing list yet message? This message was posted through the nabble interface; one would think that would be enough to get the message accepted. -- View this message in context:

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
Use your email client to send a message to the mailing list from the email address you used to subscribe? The message you just sent reached the list On Fri, Aug 14, 2015 at 9:36 AM, dutrow dan.dut...@gmail.com wrote: How do I get beyond the This post has NOT been accepted by the mailing list

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
For those who find this post and may be interested, the most thorough documentation on the subject may be found here: https://github.com/koeninger/kafka-exactly-once -- View this message in context:

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Dan Dutrow
Thanks. Looking at the KafkaCluster.scala code, ( https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala#L253), it seems a little hacky for me to alter and recompile spark to expose those methods, so I'll use the receiver API

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Salih Oztop
Hi Jerry,This blog post is perfect for window functions in Spark.https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html and a generic sql usage from oracle-base blog.https://oracle-base.com/articles/misc/lag-lead-analytic-functions It seems you are not using

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written as soon as the driver program retrieves an RDD from Kafka (or another source)? Or, is it written after that RDD has been processed and we're

QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
I want to test some Spark Streaming code that is using reduceByKeyAndWindow. If I do not enable checkpointing, I get the error: java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint(). But if I enable

RE: Spark Job Hangs on our production cluster

2015-08-14 Thread java8964
I still want to check if anyone can provide any help related to the Spark 1.2.2 will hang on our production cluster when reading Big HDFS data (7800 avro blocks), while looks fine for small data (769 avro blocks). I enable the debug level in the spark log4j, and attached the log file if it

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Cody Koeninger
I don't entirely agree with that assessment. Not paying for extra cores to run receivers was about as important as delivery semantics, as far as motivations for the api. As I said in the jira tickets on the topic, if you want to use the direct api and save offsets to ZK, you can. The right way

Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
So it seems like dataframes aren't going give me a break and just work. Now it evaluates but goes nuts if it runs into a null case OR doesn't know how to get the correct data type when I specify the default value as a string expression. Let me know if anyone has a work around to this. PLEASE HELP

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
In summary, it appears that the use of the DirectAPI was intended specifically to enable exactly-once semantics. This can be achieved for idempotent transformations and with transactional processing using the database to guarantee an onto mapping of results based on inputs. For the latter, you

Fwd: Graphx - how to add vertices to a HashSet of vertices ?

2015-08-14 Thread Ranjana Rajendran
-- Forwarded message -- From: Ranjana Rajendran ranjana.rajend...@gmail.com Date: Thu, Aug 13, 2015 at 7:37 AM Subject: Graphx - how to add vertices to a HashSet of vertices ? To: d...@spark.apache.org Hi, sampledVertices is a HashSet of vertices var sampledVertices:

Help with persist: Data is requested again

2015-08-14 Thread Saif.A.Ellafi
Hello all, I am writing a program which calls from a database. A run a couple computations, but in the end I have a while loop, in which I make a modification to the persisted thata. eg: val data = PairRDD... persist() var i = 0 while (i 10) { val data_mod = data.map(_._1 + 1, _._2)

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
Thanks Marcelo. But our problem is little complicated. We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Holden Karau
I just pushed some code that does this for spark-testing-base ( https://github.com/holdenk/spark-testing-base ) (its in master) and will publish an updated artifact with it for tonight. On Fri, Aug 14, 2015 at 3:35 PM, Tathagata Das t...@databricks.com wrote: A hacky workaround is to create a

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
Hi Salih, Normally I do sort before performing that operation, but since I've been trying to get this working for a week, I'm just loading something simple to test if lag works. Earlier I was having DB issues so it's been a long run of solving one runtime exception after another. Hopefully

Re: distributing large matrices

2015-08-14 Thread Rob Sargent
@Koen, If you meant to reply to my question on distributing matrices, could you re-send as there was not content in your post. Thanks, On 08/07/2015 10:02 AM, Koen Vantomme wrote: Verzonden vanaf mijn Sony Xperia™-smartphone iceback schreef Is this the sort of problem spark

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second.

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
First you create the file: final File outputFile = new File(outputPath); Then you write to it: Files.append(counts + \n, outputFile, Charset.defaultCharset()); Cheers On Fri, Aug 14, 2015 at 4:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I thought prefix meant the output

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Tathagata Das
A hacky workaround is to create a customer InputDStream that creates the right RDDs based on a function. The TestInputDStream https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala#L61 does something similar for Spark Streaming unit

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
Please take a look at JavaPairDStream.scala: def saveAsHadoopFiles[F : OutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F]) { Did you intend to use outputPath as prefix ? Cheers On Fri, Aug

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
Spark 1.3 Code: wordCounts.foreachRDD(*new* *Function2JavaPairRDDString, Integer, Time, Void()* { @Override *public* Void call(JavaPairRDDString, Integer rdd, Time time) *throws* IOException { String counts = Counts at time + time + + rdd.collect(); System.*out*.println(counts);

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
On Fri, Aug 14, 2015 at 2:11 PM, Varadhan, Jawahar varad...@yahoo.com.invalid wrote: And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume runs on a JVM and may not be the best option as the huge file will create memory issues. Please suggest someway to

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Cody Koeninger
You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
Still not cooperating... lag(A,1,'X') OVER (ORDER BY A) as LA ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) at

Re: worker and executor memory

2015-08-14 Thread James Pirz
Additional Comment: I checked the disk usage on the 3 nodes (using iostat) and it seems that reading from HDFS partitions happen in a node-by-node basis. Only one of the nodes shows active IO (as read) at any given time while the other two nodes are idle IO-wise. I am not sure why the tasks are

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Mohit Anchlia
I thought prefix meant the output path? What's the purpose of prefix and where do I specify the path if not in prefix? On Fri, Aug 14, 2015 at 4:36 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at JavaPairDStream.scala: def saveAsHadoopFiles[F : OutputFormat[_, _]]( prefix:

Executors on multiple nodes

2015-08-14 Thread Mohit Anchlia
I am running on Yarn and do have a question on how spark runs executors on different data nodes. Is that primarily decided based on number of receivers? What do I need to do to ensure that multiple nodes are being used for data processing?

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
I feel the real fix here is to remove the exception from QueueInputDStream class by reverting the fix of https://issues.apache.org/jira/browse/SPARK-8630 I can write another class that is identical to the QueueInputDStream class except it does not throw the exception. But this feels like a

Re: How to save a string to a text file ?

2015-08-14 Thread Brandon White
Convert it to a rdd then save the rdd to a file val str = dank memes sc.parallelize(List(str)).saveAsTextFile(str.txt) On Fri, Aug 14, 2015 at 7:50 PM, go canal goca...@yahoo.com.invalid wrote: Hello again, online resources have sample code for writing RDD to a file, but I have a simple

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis
Another fix might be to remove the exception that is thrown when windowing and other stateful operations are used without checkpointing. On Fri, Aug 14, 2015 at 5:43 PM, Asim Jalis asimja...@gmail.com wrote: I feel the real fix here is to remove the exception from QueueInputDStream class by

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Thanks, Cody. It sounds like Spark Streaming has enough state info to know how many batches have been processed and if not all of them then the RDD is 'unfinished'. I wonder if it would know whether the last micro-batch has been fully processed successfully. Hypothetically, the driver program

Re: How to save a string to a text file ?

2015-08-14 Thread go canal
thank you very much. just a quick question - I try to save string in this way but the file is always empty:     val file = Path (sample data/ZN_SPARK.OUT).createFile(true)    file.bufferedWriter().write(im.toString())    file.bufferedWriter().flush()    file.bufferedWriter().close() anything

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Raghavendra Pandey
In spark 1.4 there is a parameter to control that. Its default value is 10 M. So you need to cache your dataframe to hint the size. On Aug 14, 2015 7:09 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Hi I am facing huge performance problem when I am trying to left outer join very big

How to save a string to a text file ?

2015-08-14 Thread go canal
Hello again,online resources have sample code for writing RDD to a file, but I have a simple string, how to save to a text file ? (my data is a DenseMatrix actually) appreciate any help ! thanks, canal

Re: How to specify column type when saving DataFrame as parquet file?

2015-08-14 Thread Francis Lau
Jyun Fan Here is how I have been doing it. I found that I needed to define the schema when loading the JSON file first Francis import datetime from pyspark.sql.types import * # Define schema upSchema = StructType([ StructField(field 1, StringType(), True), StructField(field 2, LongType(),

Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-14 Thread stelsavva
Hello, I am just starting out with spark streaming and Hbase/hadoop, i m writing a simple app to read from kafka and store to Hbase, I am having trouble submitting my job to spark. I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6 I am building the project with mvn package and

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-14 Thread Nisrina Luthfiyati
Hi Cody, by start/stopping, do you mean the streaming context or the app entirely? From what I understand once a streaming context has been stopped it cannot be restarted, but I also haven't found a way to stop the app programmatically. The batch duration will probably be around 1-10 seconds. I

Re: spark.files.userClassPathFirst=true Return Error - Please help

2015-08-14 Thread Kyle Lin
Hello Akhil I use Spark 1.4.2 on HDP 2.1(Hadoop 2.4) I didn't use --driver-class-path. I only use spark.executor.userClassPathFirst=true Kyle 2015-08-14 17:11 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com: Which version of spark are you using? Did you try with --driver-class-path

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-14 Thread satish chandra j
Hi Akhil, Which jar version is conflicting and what needs to be done for the fix Regards, Satish Chandra On Fri, Aug 14, 2015 at 2:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like a jar version conflict to me. Thanks Best Regards On Thu, Aug 13, 2015 at 7:59 PM, satish

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Zoltán Zvara
Data skew is still a problem with Spark. - If you use groupByKey, try to express your logic by not using groupByKey. - If you need to use groupByKey, all you can do is to scale vertically. - If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks

Re: Sorted Multiple Outputs

2015-08-14 Thread Yiannis Gkoufas
Hi Eugene, in my case the list of values that I want to sort and write to a separate file, its fairly small so the way I solved it is the following: .groupByKey().foreach(e = { val hadoopConfig = new Configuration() val hdfs = FileSystem.get(hadoopConfig); val newPath = rootPath+/+e._1;

Re: spark streaming map use external variable occur a problem

2015-08-14 Thread Shixiong Zhu
Looks you compiled the codes with one Scala version but ran your app using a different incompatible version. BTW, you should not use PrintWriter like this to save your results. There may be multiple tasks running at the same host, and your job will fail because you are trying to write to the same

Re: Spark RuntimeException hadoop output format

2015-08-14 Thread Ted Yu
Which Spark release are you using ? Can you show us snippet of your code ? Have you checked namenode log ? Thanks On Aug 13, 2015, at 10:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I was able to get this working by using an alternative method however I only see 0 bytes files in