Re: Setup Remote HDFS for Spark

2014-11-22 Thread Akhil Das
Yes, it is possible to have both spark and hdfs running on the same cluster. We have a lot of clusters running without any issues. And yes, it is possible to hook spark up with remote hdfs. You might feel a bit lag if they are on different networks. Thanks Best Regards On Fri, Nov 21, 2014 at

Re: Error: Unrecognized option '--conf' (trying to set auto.offset.reset)

2014-11-22 Thread Akhil Das
If it is not picking it up from command line, you could try adding that entry inside conf/spark-defaults.conf file and then try submitting the job. (Of course, you might want to restart the cluster) Isn't auto.offset.reset=largest a kafka conf? You might want to set it inside the kafka conf.

Re: Lots of small input files

2014-11-22 Thread Akhil Das
What is your cluster setup? are you running a worker on the master node also? 1. Spark usually assigns the task to the worker who has the data locally available, If one worker has enough tasks, then i believe it will start assigning to others as well. You could control it with the level of

Re: SparkSQL Timestamp query failure

2014-11-22 Thread Akhil Das
What about sqlContext.sql(SELECT * FROM Logs as l where l.timestamp=*'2012-10-08 16:10:36.0'*).collect You might need to quote the timestamp it looks like. Thanks Best Regards On Sat, Nov 22, 2014 at 12:09 AM, whitebread ale.panebia...@me.com wrote: Hi all, I put some log files into sql

Re: Spark streaming job failing after some time.

2014-11-22 Thread Akhil Das
For Spark streaming, you must always set *--executor-cores* to a value which is = 2. Or else it will not do any processing. Thanks Best Regards On Sat, Nov 22, 2014 at 8:39 AM, pankaj channe pankajc...@gmail.com wrote: I have seen similar posts on this issue but could not find solution.

Re: spark-sql broken

2014-11-22 Thread Cheng Lian
You're probably hitting this issue https://issues.apache.org/jira/browse/SPARK-4532 Patrick made a fix for this https://github.com/apache/spark/pull/3398 On 11/22/14 10:39 AM, tridib wrote: After taking today's build from master branch I started getting this error when run spark-sql: Class

Re: Debug Sql execution

2014-11-22 Thread Cheng Lian
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed logical plan, optimized logical plan and physical plan. Also |SchemaRDD.toDebugString| shows storage related debugging information. On 11/21/14 4:11 AM, Gordon Benjamin wrote: hey, Can anyone tell me how to debug a sql

Re: querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-22 Thread Cheng Lian
This thread might be helpful http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html On 11/20/14 4:11 AM, Mohammed Guller wrote: Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder CTO On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote: I have a simple S3 job to read a text file and do a line count. Specifically I'm

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Err I meant #1 :) - Nitay Founder CTO On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote: Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder CTO On Thu, Nov 20, 2014 at 11:54 AM,

Spark or MR, Scala or Java?

2014-11-22 Thread Guillermo Ortiz
Hello, I'm a newbie with Spark but I've been working with Hadoop for a while. I have two questions. Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? The other question is, I know Java, is it worth it to learn

RE: Spark or MR, Scala or Java?

2014-11-22 Thread Ashic Mahtab
Spark can do Map Reduce and more, and faster. One area where using MR would make sense is if you're using something (maybe like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible now...just pulled that name out of thin air!). You *can* use Spark from Java, but you'd have a

Re: SparkSQL Timestamp query failure

2014-11-22 Thread whitebread
Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve this. Any other hint by any chance? Thanks,

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Denny Lee
Just to add some more stuff - there are various scenarios where traditional Hadoop makes more sense than Spark. For example, if you have a long running processing job in which you do not want to utilize too many resources of the cluster. Another example could be that you want to run a distributed

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-22 Thread Alaa Ali
Thanks Alex! I'm actually working with views from HBase because I will never edit the HBase table from Phoenix and I'd hate to accidentally drop it. I'll have to work out how to create the view with the additional ID column. Regards, Alaa Ali On Fri, Nov 21, 2014 at 5:26 PM, Alex Kamil

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Sean Owen
MapReduce is simpler and narrower, which also means it is generally lighter weight, with less to know and configure, and runs more predictably. If you have a job that is truly just a few maps, with maybe one reduce, MR will likely be more efficient. Until recently its shuffle has been more

Getting exception on JavaSchemaRDD; org.apache.spark.SparkException: Task not serializable

2014-11-22 Thread vdiwakar.malladi
Hi I'm trying to load the parquet file for querying purpose from my web application. I could able to load it as JavaSchemaRDD. But at the time of using map function on the JavaSchemaRDD, I'm getting the following exception. The class in which I'm using this code implements Serializable class.

RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum
Hi all, I am trying to persist a spark RDD in which the elements of each partition all share access to a single, large object. However, this object seems get stored in memory several times. Reducing my problem down to the toy case of just a single partition with only 200 elements: *val*

Re: Spark streaming job failing after some time.

2014-11-22 Thread Sean Owen
That doesn't seem to be the problem though. It processes but then stops. Presumably there are many executors. On Nov 22, 2014 9:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote: For Spark streaming, you must always set *--executor-cores* to a value which is = 2. Or else it will not do any

Re: Missing parents for stage (Spark Streaming)

2014-11-22 Thread Sean Owen
This message appears in normal operation. I do not think it refers to anything in your code. On Nov 21, 2014 11:57 PM, YaoPau jonrgr...@gmail.com wrote: When I submit a Spark Streaming job, I see these INFO logs printing frequently: 14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216)

Re: Spark streaming job failing after some time.

2014-11-22 Thread pankaj channe
Thanks Akhil for your input. I have already tried with 3 executors and it still results into the same problem. So as Sean mentioned, the problem does not seem to be related to that. On Sat, Nov 22, 2014 at 11:00 AM, Sean Owen so...@cloudera.com wrote: That doesn't seem to be the problem

Re: Getting exception on JavaSchemaRDD; org.apache.spark.SparkException: Task not serializable

2014-11-22 Thread Akhil Das
You could be referring/sending the StandardSessionFacade inside your map function. You could bring the class StandardSessionFacade locally and Serialize it to get it fixed quickly. Thanks Best Regards On Sat, Nov 22, 2014 at 10:02 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: Hi

Re: Getting exception on JavaSchemaRDD; org.apache.spark.SparkException: Task not serializable

2014-11-22 Thread vdiwakar.malladi
Thanks for your prompt response. I'm not using any thing in my map function. please see the below code. For sample purpose, I would like to using 'select * from '. This code worked for me in standalone mode. But when I integrated with my web application, it is throwing the specified exception.

Re: Getting exception on JavaSchemaRDD; org.apache.spark.SparkException: Task not serializable

2014-11-22 Thread Sean Owen
You are declaring an anonymous inner class here. It has a reference to the containing class even if you don't use it. If the closure cleaner can't determine it isn't used, this reference will cause everything in the outer class to serialize. Try rewriting this as a named static inner class . On

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Not that I'm professional user of Amazon services, but I have a guess about your performance issues. From [1], there are two different filesystems over S3: - native that behaves just like regular files (schema: s3n) - block-based that looks more like HDFS (schema: s3) Since you use s3n in your

Re: Bug in Accumulators...

2014-11-22 Thread octavian.ganea
One month later, the same problem. I think that someone (e.g. inventors of Spark) should show us a big example of how to use accumulators. I can start telling that we need to see an example of the following form: val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).map(x =

Re: Getting exception on JavaSchemaRDD; org.apache.spark.SparkException: Task not serializable

2014-11-22 Thread vdiwakar.malladi
Thanks. After writing it as static inner class, that exception not coming. But getting snappy related exception. I could see the corresponding dependency is in the spark assembly jar. Still getting the exception. Any quick suggestion on this? Here is the stack trace.

Error when Spark streaming consumes from Kafka

2014-11-22 Thread Bill Jay
Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries

SparkBigData.com: The Apache Spark Knowledge Base

2014-11-22 Thread Slim Baltagi
Hello all I'm very pleased to announce the launch of http://www.SparkBigData.com: The Apache Spark Knowledge Base. As your one-stop information resource dedicated to Apache Spark. SparkBigData.com, provides free, easy and fast access to hundreds of Apache Spark resources organized in several

Re: Extracting values from a Collecion

2014-11-22 Thread Sanjay Subramanian
Thanks Jeyregardssanjay From: Jey Kottalam j...@cs.berkeley.edu To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: Arun Ahuja aahuj...@gmail.com; Andrew Ash and...@andrewash.com; user user@spark.apache.org Sent: Friday, November 21, 2014 10:07 PM Subject: Extracting values from a

Merging Parquet Files

2014-11-22 Thread Daniel Haviv
Hi, I'm ingesting a lot of small JSON files and convert them to unified parquet files, but even the unified files are fairly small (~10MB). I want to run a merge operation every hour on the existing files, but it takes a lot of time for such a small amount of data: about 3 GB spread of 3000

Re: Bug in Accumulators...

2014-11-22 Thread Sean Owen
That seems to work fine. Add to your example def foo(i: Int, a: Accumulator[Int]) = a += i and add an action at the end to get the expression to evaluate: sc.parallelize(Array(1, 2, 3, 4)).map(x = foo(x,accum)).foreach(println) and it works, and you have accum with value 10 at the end. The

Re: Error: Unrecognized option '--conf' (trying to set auto.offset.reset)

2014-11-22 Thread Sean Owen
First, the --conf error: What version of Spark? I don't think some of these existed before 1.1 so that may be the issue. This is all on one line I assume. Quoting is not an issue here. The real issue is that auto.reset.offset is indeed a Kafka option. It's not a system property; if it were, you

Re: Bug in Accumulators...

2014-11-22 Thread octavian.ganea
Hi Sowen, You're right, that example works, but look what example does not work for me: object Main { def main(args: Array[String]) { val conf = new SparkConf().setAppName(name) val sc = new SparkContext(conf) val accum = sc.accumulator(0) for (i - 1 to 10) { val

Re: Extracting values from a Collecion

2014-11-22 Thread Sanjay Subramanian
I could not iterate thru the set but changed the code to get what I was looking for(Not elegant but gets me going) package org.medicalsidefx.common.utils import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import scala.collection.mutable.ArrayBuffer /** *

Re: RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum
Some more details: Adding a println to the function reveals that it is indeed called only once. Furthermore, running: /rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true ...reveals that all 1000 elements do indeed point to the same object, and so the data structure

Re: Bug in Accumulators...

2014-11-22 Thread lordjoe
I posted several examples in java at http://lordjoesoftware.blogspot.com/ Generally code like this works and I show how to accumulate more complex values. // Make two accumulators using Statistics final AccumulatorInteger totalLetters= ctx.accumulator(0L, ttl);

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers: - Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage

Re: Bug in Accumulators...

2014-11-22 Thread Mohit Jaggi
perhaps the closure ends up including the main object which is not defined as serializable...try making it a case object or object main extends Serializable. On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote: I posted several examples in java at

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Concerning your second question, I believe you try to set number of partitions with something like this: rdd = sc.textFile(..., 8) but things like `textFile()` don't actually take fixed number of partitions. Instead, they expect *minimal* number of partitions. Since in your file you have 21

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Soumya Simanta
Thanks Sean. adding user@spark.apache.org again. On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote: On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta soumya.sima...@gmail.com wrote: Is the MapReduce API simpler or the implementation? Almost, every Spark presentation has a

Re: Error when Spark streaming consumes from Kafka

2014-11-22 Thread Dibyendu Bhattacharya
I believe this is something to do with how Kafka High Level API manages consumers within a Consumer group and how it re-balance during failure. You can find some mention in this Kafka wiki. https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design Due to various issues in Kafka

Re: Persist kafka streams to text file, tachyon error?

2014-11-22 Thread Haoyuan Li
StorageLevel.OFF_HEAP requires to run Tachyon: http://spark.apache.org/docs/latest/programming-guide.html If you don't know if you have tachyon or not, you probably don't :) http://tachyon-project.org/ For local testing, you can use other persist() solutions without running Tachyon. Best,