Re: Bug in Accumulators...
As Mohit said, making Main extend Serializable should fix this example. In general, it's not a bad idea to mark the fields you don't want to serialize (e.g., sc and conf in this case) as @transient as well, though this is not the issue in this case. Note that this problem would not have arisen in your very specific example if you used a while loop instead of a for-each loop, but that's really more of a happy coincidence than something you should rely on, as nested lambdas are virtually unavoidable in Scala. On Sat, Nov 22, 2014 at 5:16 PM, Mohit Jaggi mohitja...@gmail.com wrote: perhaps the closure ends up including the main object which is not defined as serializable...try making it a case object or object main extends Serializable. On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote: I posted several examples in java at http://lordjoesoftware.blogspot.com/ Generally code like this works and I show how to accumulate more complex values. // Make two accumulators using Statistics final AccumulatorInteger totalLetters= ctx.accumulator(0L, ttl); JavaRDDstring lines = ... JavaRDDstring words = lines.flatMap(new FlatMapFunctionString, String() { @Override public Iterablestring call(final String s) throws Exception { // Handle accumulator here totalLetters.add(s.length()); // count all letters }); Long numberCalls = totalCounts.value(); I believe the mistake is to pass the accumulator to the function rather than letting the function find the accumulator - I do this in this case by using a final local variable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p19579.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bug in Accumulators...
Here, the Main object is not meant to be serialized. transient ought to be for fields that are within an object that is legitimately supposed to be serialized, but, whose value can be recreated on deserialization. I feel like marking objects that aren't logically Serializable as such is a hack, and transient extend that hack, and will cause surprises later. Hack away for toy examples but ideally the closure cleaner would snip whatever phantom reference is at work here. I usually try to rewrite the Scala as you say to avoid the issue rather than make things Serializable ad hoc. On Sun, Nov 23, 2014 at 10:49 AM, Aaron Davidson ilike...@gmail.com wrote: As Mohit said, making Main extend Serializable should fix this example. In general, it's not a bad idea to mark the fields you don't want to serialize (e.g., sc and conf in this case) as @transient as well, though this is not the issue in this case. Note that this problem would not have arisen in your very specific example if you used a while loop instead of a for-each loop, but that's really more of a happy coincidence than something you should rely on, as nested lambdas are virtually unavoidable in Scala. On Sat, Nov 22, 2014 at 5:16 PM, Mohit Jaggi mohitja...@gmail.com wrote: perhaps the closure ends up including the main object which is not defined as serializable...try making it a case object or object main extends Serializable. On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote: I posted several examples in java at http://lordjoesoftware.blogspot.com/ Generally code like this works and I show how to accumulate more complex values. // Make two accumulators using Statistics final AccumulatorInteger totalLetters= ctx.accumulator(0L, ttl); JavaRDDstring lines = ... JavaRDDstring words = lines.flatMap(new FlatMapFunctionString, String() { @Override public Iterablestring call(final String s) throws Exception { // Handle accumulator here totalLetters.add(s.length()); // count all letters }); Long numberCalls = totalCounts.value(); I believe the mistake is to pass the accumulator to the function rather than letting the function find the accumulator - I do this in this case by using a final local variable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p19579.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark serialization issues with third-party libraries
Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make some class members transient. http://spark-summit.org/2014/talk/leveraging-uima-in-spark Thanks - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or MR, Scala or Java?
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis. To get me jumpstarted into Spark I started this gitHub where there is IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will keep adding to that. I dont know scala and I am learning that too to help me use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git Philosophically speaking its possibly not a good idea to take an either/or approach to technology...Like its never going to be either RDBMS or NOSQL (If the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a day for some reason u may not be as upset...but if the Oracle/Db2 systems behind Wells Fargo show $100 LESS in your account due to an database error, you will be PANIC-ing). So its the same case with Spark or Hadoop. I can speak for myself. I have a usecase for processing old logs that are multiline (i.e. they have a [begin_timestamp_logid] and [end_timestamp_logid] and have many lines in between. In Java Hadoop I created custom RecordReaders to solve this. I still dont know how to do this in Spark. Till that time I am possibly gonna run the Hadoop code within Oozie in production. Also my current task is evangelizing Big Data at my company. So the tech people I can educate with Hadoop and Spark and they would learn that but not the business intelligence analysts. They love SQL so I have to educate them using Hive , Presto, Impala...so the question is what is your task or tasks ? Sorry , a long non technical answer to your question... Make sense ? sanjay From: Krishna Sankar ksanka...@gmail.com To: Sean Owen so...@cloudera.com Cc: Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org Sent: Saturday, November 22, 2014 4:53 PM Subject: Re: Spark or MR, Scala or Java? Adding to already interesting answers: - Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage map-reduce flows, - Complex map-reduce semantics - Spark is definitely better for the classic iterative,interactive workloads. - Spark is very effective for implementing the concepts of in-memory datasets real time analytics - Take a look at the Lambda architecture - Also checkout how Ooyala is using Spark in multiple layers configurations. They also have MR in many places - In our case, we found Spark very effective for ELT - we would have used MR earlier - I know Java, is it worth it to learn Scala for programming to Spark or it's okay just with Java? - Java will work fine. Especially when Java 8 becomes the norm, we will get back some of the elegance - I, personally, like Scala Python lot better than Java. Scala is a lot more elegant, but compilations, IDE integration et al are still clunky - One word of caution - stick with one language as much as possible-shuffling between Java Scala is not fun Cheers HTHk/ On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote: MapReduce is simpler and narrower, which also means it is generally lighter weight, with less to know and configure, and runs more predictably. If you have a job that is truly just a few maps, with maybe one reduce, MR will likely be more efficient. Until recently its shuffle has been more developed and offers some semantics the Spark shuffle does not.I suppose it integrates with tools like Oozie, that Spark does not. I suggest learning enough Scala to use Spark in Scala. The amount you need to know is not large.(Mahout MR based implementations do not run on Spark and will not. They have been removed instead.)On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote: Hello, I'm a newbie with Spark but I've been working with Hadoop for a while. I have two questions. Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? The other question is, I know Java, is it worth it to learn Scala for programming to Spark or it's okay just with Java? I have done a little piece of code with Java because I feel more confident with it,, but I seems that I'm missed something - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Programming Guide - registerTempTable Error
Hi guys , Im trying to do the Spark SQL Programming Guide but after the: case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) im issuing: people.registerTempTable(people) console:20: error: value registerTempTable is not a member of org.apache.spark.rdd.RDD[Person] people.registerTempTable(people) why is that what I'm i doing wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark or MR, Scala or Java?
This being a very broad topic, a discussion can quickly get subjective. I'll try not to deviate from my experiences and observations to keep this thread useful to those looking for answers. I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My observations are below. Spark and Hadoop MR: 1. There doesn't have to be a dichotomy between Hadoop ecosystem and Spark since Spark is a part of it. 2. Spark or Hadoop MR, there is no getting away from learning how partitioning, input splits, and shuffle process work. In order to optimize performance, troubleshoot and design software one must know these. I recommend reading first 6-7 chapters of Hadoop The definitive Guide book to develop initial understanding. Indeed knowing a couple of divide and conquer algorithms is a pre-requisite and I assume everyone on this mailing list is very familiar :) 3. Having used a lot of different APIs and layers of abstraction for Hadoop MR, my experience progressing from MR Java API -- Cascading -- Scalding is that each new API looks simpler than the previous one. However, Spark API and abstraction has been simplest. Not only for me but those who I have seen start with Hadoop MR or Spark first. It is easiest to get started and become productive with Spark with the exception of Hive for those who are already familiar with SQL. Spark's ease of use is critical for teams starting out with Big Data. 4. It is also extremely simple to chain multi-stage jobs in Spark, you do it without even realizing by operating over RDDs. In Hadoop MR, one has to handle it explicitly. 5. Spark has built-in support for graph algorithms (including Bulk Synchronous Parallel processing BSP algorithms e.g. Pregel), Machine Learning and Stream processing. In Hadoop MR you need a separate library/Framework for each and it is non-trivial to combine multiple of these in the same application. This is huge! 6. In Spark one does have to learn how to configure the memory and other parameters of their cluster. Just to be clear, similar parameters exist in MR as well (e.g. shuffle memory parameters) but you don't *have* to learn about tuning them until you have jobs with larger data size jobs. In Spark you learn this by reading the configuration and tuning documentation followed by experimentation. This is an area of Spark where things can be better. Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Myth: Spark is for in-memory processing *only*. This is a common beginner misunderstanding. Sanjay: Spark uses Hadoop API for performing I/O from file systems (local, HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and RecordReader with Spark that you use with Hadoop for your multi-line record format. See SparkContext APIs. Just like Hadoop, you will need to make sure that your files are split at record boundaries. Hope this is helpful. On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis. To get me jumpstarted into Spark I started this gitHub where there is IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will keep adding to that. I dont know scala and I am learning that too to help me use Spark better. https://github.com/sanjaysubramanian/msfx_scala.git Philosophically speaking its possibly not a good idea to take an either/or approach to technology...Like its never going to be either RDBMS or NOSQL (If the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a day for some reason u may not be as upset...but if the Oracle/Db2 systems behind Wells Fargo show $100 LESS in your account due to an database error, you will be PANIC-ing). So its the same case with Spark or Hadoop. I can speak for myself. I have a usecase for processing old logs that are multiline (i.e. they have a [begin_timestamp_logid] and [end_timestamp_logid] and have many lines in between. In Java Hadoop I created custom RecordReaders to solve this. I still dont know how to do this in Spark. Till that time I am possibly gonna run the Hadoop code within Oozie in production. Also my current task is evangelizing Big Data at my company. So the tech people I can educate with Hadoop and Spark and they would learn that but not the business intelligence analysts. They love SQL so I have to educate them using Hive , Presto, Impala...so the question is
Spark Streaming with Python
I am trying to run network_wordcount.py example mentioned at https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py on CDH5.2 Quickstart VM. Getting below error. Traceback (most recent call last): File /usr/lib/spark/examples/lib/network_wordcount.py, line 4, in module from pyspark.streaming import StreamingContext ImportError: No module named streaming. How to resolve this? Regards, Venkat This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
Re: Spark SQL Programming Guide - registerTempTable Error
By any chance are you using Spark 1.0.2? registerTempTable was introduced from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable. On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote: Hi guys , Im trying to do the Spark SQL Programming Guide but after the: case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(examples/src/main/resources/people.txt). map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) im issuing: people.registerTempTable(people) console:20: error: value registerTempTable is not a member of org.apache.spark.rdd.RDD[Person] people.registerTempTable(people) why is that what I'm i doing wrong? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error- tp19591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Programming Guide - registerTempTable Error
That was the problem ! Thank you Denny for your fast response! Another quick question: Is there any way to update spark to 1.1.0 fast? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Python Logistic Regression error
Can you please suggest sample data for running the logistic_regression.py? I am trying to use a sample data file at https://github.com/apache/spark/blob/master/data/mllib/sample_linear_regression_data.txt I am running this on CDH5.2 Quickstart VM. [cloudera@quickstart mllib]$ spark-submit logistic_regression.py lr.txt 3 But, getting below error. 14/11/23 11:23:55 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 14/11/23 11:23:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/11/23 11:23:55 INFO TaskSchedulerImpl: Cancelling stage 0 14/11/23 11:23:55 INFO DAGScheduler: Failed to run runJob at PythonRDD.scala:296 Traceback (most recent call last): File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 50, in module model = LogisticRegressionWithSGD.train(points, iterations) File /usr/lib/spark/python/pyspark/mllib/classification.py, line 110, in train initialWeights) File /usr/lib/spark/python/pyspark/mllib/_common.py, line 430, in _regression_train_wrapper initial_weights = _get_initial_weights(initial_weights, data) File /usr/lib/spark/python/pyspark/mllib/_common.py, line 415, in _get_initial_weights initial_weights = _convert_vector(data.first().features) File /usr/lib/spark/python/pyspark/rdd.py, line 1127, in first rs = self.take(1) File /usr/lib/spark/python/pyspark/rdd.py, line 1109, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /usr/lib/spark/python/pyspark/context.py, line 770, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.139.145): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /usr/lib/spark/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /usr/lib/spark/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /usr/lib/spark/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /usr/lib/spark/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /usr/lib/spark/python/pyspark/rdd.py, line 1105, in takeUpToNumLeft yield next(iterator) File /usr/lib/spark/examples/lib/mllib/logistic_regression.py, line 37, in parsePoint values = [float(s) for s in line.split(' ')] ValueError: invalid literal for float(): 1:0.4551273600657362 Regards, Venkat This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
Re: Spark or MR, Scala or Java?
On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Scala is arguably more elegant and less verbose than Java. However, Scala is also a complex language with a lot of details and tidbits and one-offs that you just have to remember. It is sometimes difficult to make a decision whether what you wrote is the using the language features most effectively or if you missed out on an available feature that could have made the code better or more concise. For Spark you really do not need to know that much Scala but you do need to understand the essence of it. Thanks for the good discussion! :-) Ognen
Re: Spark SQL Programming Guide - registerTempTable Error
It sort of depends on your environment. If you are running on your local environment, I would just download the latest Spark 1.1 binaries and you'll be good to go. If its a production environment, it sort of depends on how you are setup (e.g. AWS, Cloudera, etc.) On Sun Nov 23 2014 at 11:27:49 AM riginos samarasrigi...@gmail.com wrote: That was the problem ! Thank you Denny for your fast response! Another quick question: Is there any way to update spark to 1.1.0 fast? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error- tp19591p19595.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Converting a column to a map
Hi, I have a column in my schemaRDD that is a map but I'm unable to convert it to a map.. I've tried converting it to a Tuple2[String,String]: val converted = jsonFiles.map(line= { line(10).asInstanceOf[Tuple2[String,String]]}) but I get ClassCastException: 14/11/23 11:51:30 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to scala.Tuple2 And if if convert it to Iterable[String] I can only get the values without the keys. What it the correct data type I should convert it to ? Thanks, Daniel
Creating a front-end for output from Spark/PySpark
Hello. Okay, so I'm working on a project to run analytic processing using Spark or PySpark. Right now, I connect to the shell and execute my commands. The very first part of my commands is: create an SQL JDBC connection and cursor to pull from Apache Phoenix, do some processing on the returned data, and spit out some output. I want to create a web gui tool kind of a thing where I play around with what SQL query is executed for my analysis. I know that I can write my whole Spark program and use spark-submit and have it accept and argument to be the SQL query I want to execute, but this means that every time I submit: an SQL connection will be created, query ran, processing done, output printed, program closes and SQL connection closes, and then the whole thing repeats if I want to do another query right away. That will probably cause it to be very slow. Is there a way where I can somehow have the SQL connection working in the backend for example, and then all I have to do is supply a query from my GUI tool where it then takes it, runs it, displays the output? I just want to know the big picture and a broad overview of how would I go about doing this and what additional technology to use and I'll dig up the rest. Regards, Alaa Ali
Re: Error when Spark streaming consumes from Kafka
Hi Dibyendu, Thank you for answer. I will try the Spark-Kafka consumer. Bill On Sat, Nov 22, 2014 at 9:15 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: I believe this is something to do with how Kafka High Level API manages consumers within a Consumer group and how it re-balance during failure. You can find some mention in this Kafka wiki. https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design Due to various issues in Kafka High Level APIs, Kafka is moving the High Level Consumer API to a complete new set of API in Kafka 0.9. Other than this co-ordination issue, High Level consumer also has data loss issues. You can probably try this Spark-Kafka consumer which uses Low Level Simple consumer API which is more performant and have no data loss scenarios. https://github.com/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432) kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722) kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:212) kafka.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:138) org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:114) org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Does anyone know the reason for this exception? Thanks! Bill
Re: Creating a front-end for output from Spark/PySpark
Alaa, one option is to use Spark as a cache, importing subset of data from hbase/phoenix that fits in memory, and using jdbcrdd to get more data on cache miss. The front end can be created with pyspark and flusk, either as rest api translating json requests to sparkSQL dialect, or simply allowing the user to post sparkSql queries directly On Sun, Nov 23, 2014 at 3:37 PM, Alaa Ali contact.a...@gmail.com wrote: Hello. Okay, so I'm working on a project to run analytic processing using Spark or PySpark. Right now, I connect to the shell and execute my commands. The very first part of my commands is: create an SQL JDBC connection and cursor to pull from Apache Phoenix, do some processing on the returned data, and spit out some output. I want to create a web gui tool kind of a thing where I play around with what SQL query is executed for my analysis. I know that I can write my whole Spark program and use spark-submit and have it accept and argument to be the SQL query I want to execute, but this means that every time I submit: an SQL connection will be created, query ran, processing done, output printed, program closes and SQL connection closes, and then the whole thing repeats if I want to do another query right away. That will probably cause it to be very slow. Is there a way where I can somehow have the SQL connection working in the backend for example, and then all I have to do is supply a query from my GUI tool where it then takes it, runs it, displays the output? I just want to know the big picture and a broad overview of how would I go about doing this and what additional technology to use and I'll dig up the rest. Regards, Alaa Ali
How to insert complex types like mapstring,mapstring,int in spark sql
Hi, I am trying to insert particular set of data from rdd to a hive table I have Map[String,Map[String,Int]] in scala which I want to insert into the table of mapstring,maplt;string,int I was able to create the table but while inserting it says scala.MatchError: MapType(StringType,MapType(StringType,IntegerType,true),true) (of class org.apache.spark.sql.catalyst.types.MapType) can any one help me with this. Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-string-map-string-int-in-spark-sql-tp19603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to keep a local variable in each cluster?
Hi, I am new to spark. This is the first time I am posting here. Currently, I try to implement ADMM optimization algorithms for Lasso/SVM Then I come across a problem: Since the training data(label, feature) is large, so I created a RDD and cached the training data(label, feature ) in memory. Then for ADMM, it needs to keep local parameters (u,v) (which are different for each partition ). For each iteration, I need to use the training data(only on that partition), u, v to calculate the new value for u and v. Question1: One way is to zip (training data, u, v) into a rdd and update it in each iteration, but as we can see, training data is large and won't change for the whole time, only u, v (is small) are changed in each iteration. If I zip these three, I could not cache that rdd (since it changed for every iteration). But if did not cache that, I need to reuse the training data every iteration, how could I do it? Question2: Related to Question1, on the online documents, it said if we don't cache the rdd, it will not in the memory. And rdd uses delayed operation, then I am confused when can I view a previous rdd in memroy. Case1: B = A.map(function1). B.collect()#This forces B to be calculated ? After that, the node just release B since it is not cached ??? D = B.map(function3) D.collect() Case2: B = A.map(function1). D = B.map(function3) D.collect() Case3: B = A.map(function1). C = A.map(function2) D = B.map(function3) D.collect() In which case, can I view B is in memory in each cluster when I calculate D? Question3: can I use a function to do operations on two rdds? E.g Function newfun(rdd1, rdd2) #rdd1 is large and do not change for the whole time (training data), which I can use cache #rdd2 is small and change in each iteration (u, v ) Questions4: Or are there other ways to solve this kind of problem? I think this is common problem, but I could not find any good solutions. Thanks a lot Han -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Execute Spark programs from local machine on Yarn-hadoop cluster
I think this IS possible? You must set the HADOOP_CONF_DIR variable on the machine you’re running the Java process that creates the SparkContext. The Hadoop configuration specifies the YARN ResourceManager IPs, and Spark will use that configuration. mn On Nov 21, 2014, at 8:10 AM, Prannoy pran...@sigmoidanalytics.com wrote: Hi naveen, I dont think this is possible. If you are setting the master with your cluster details you cannot execute any job from your local machine. You have to execute the jobs inside your yarn machine so that sparkconf is able to connect with all the provided details. If this is not the case such give a detail explaintation of what exactly you are trying to do :) Thanks. On Fri, Nov 21, 2014 at 8:11 PM, Naveen Kumar Pokala [via Apache Spark User List] [hidden email] x-msg://4/user/SendEmail.jtp?type=nodenode=19484i=0 wrote: Hi, I am executing my spark jobs on yarn cluster by forming conf object in the following way. SparkConf conf = new SparkConf().setAppName(NewJob).setMaster(yarn-cluster); Now I want to execute spark jobs from my local machine how to do that. What I mean is there a way to give IP address, port all the details to connect a master(YARN) on some other network from my local spark Program. -Naveen If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482.html To start a new topic under Apache Spark User List, email [hidden email] x-msg://4/user/SendEmail.jtp?type=nodenode=19484i=1 To unsubscribe from Apache Spark User List, click here applewebdata://63F90BD1-F0D8-41D0-A1DE-B7ACE896B9D9. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml View this message in context: Re: Execute Spark programs from local machine on Yarn-hadoop cluster http://apache-spark-user-list.1001560.n3.nabble.com/Execute-Spark-programs-from-local-machine-on-Yarn-hadoop-cluster-tp19482p19484.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
wholeTextFiles on 20 nodes
I have 20 nodes via EC2 and an application that reads the data via wholeTextFiles. I've tried to copy the data into hadoop via copyFromLocal, and I get 14/11/24 02:00:07 INFO hdfs.DFSClient: Exception in createBlockOutputStream 172.31.2.209:50010 java.io.IOException: Bad connect ack with firstBadLink as X:50010 14/11/24 02:00:07 INFO hdfs.DFSClient: Abandoning block blk_-8725559184260876712_2627 14/11/24 02:00:07 INFO hdfs.DFSClient: Excluding datanode X:50010 a lot. Then I went the file system route via copy-dir, which worked well. Now everything is under /root/txt on all nodes. I submitted the job with the file:///root/txt/ directory for wholeTextFiles() and I get Exception in thread main java.io.FileNotFoundException: File does not exist: /root/txt/3521.txt The file exists on the root note and should be everywhere according to copy-dir. The hadoop variant worked fine with 3 nodes, but it starts bugging with 20. I added property namedfs.datanode.max.transfer.threads/name value4096/value /property to hdfs-site.xml and core-site.xml, didn't help. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: SparkSQL Timestamp query failure
Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [mailto:ale.panebia...@me.com] Sent: Sunday, November 23, 2014 12:11 AM To: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve this. Any other hint by any chance? Thanks, Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Lots of small input files
We encountered similar problem. If all partitions are located in the same node and all of the tasks run less than 3 seconds (set by spark.locality.wait, the default value is 3000), the tasks will run in the single node. Our solution is using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it may be not efficient because it still creates many tiny tasks. Best Regards, Shixiong Zhu 2014-11-22 17:17 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com: What is your cluster setup? are you running a worker on the master node also? 1. Spark usually assigns the task to the worker who has the data locally available, If one worker has enough tasks, then i believe it will start assigning to others as well. You could control it with the level of parallelism and all. 2. If you coalesce it into one partition, then i believe only one of the worker will execute the single task. Thanks Best Regards On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel p...@occamsmachete.com wrote: I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine. The job only does rdd.map type operations. 1) Why doesn’t it use the other workers in the cluster? 2) Is there a downside to using a lot of small part files? Should I coalesce them into one input file? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question about resource sharing in Spark Standalone
Dear all, Currently, I am running spark standalone cluster with ~100 nodes. Multiple users can connect to the cluster by Spark-shell or PyShell. However, I can't find an efficient way to control the resources among multiple users. I can set spark.deploy.defaultCores in the server side to limit the cpus for each Application. But, I cannot limit the memory usage by ease Application. User can set spark.executor.memory 10g spark.python.worker.memory 10g in ./conf/spark-defaults.conf in client side. Which means they can control how many resources they have. How can I control resources in the server side? Meanwhile, the Fair-shceduler requires the user to the pool explicitly. What if the user don't set it? Can I maintain the user-to-pool mapping in the server side? Thanks!
Re: Spark or MR, Scala or Java?
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent to the framework. Cheers k/ P.S: Now Reply to ALL On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Scala is arguably more elegant and less verbose than Java. However, Scala is also a complex language with a lot of details and tidbits and one-offs that you just have to remember. It is sometimes difficult to make a decision whether what you wrote is the using the language features most effectively or if you missed out on an available feature that could have made the code better or more concise. For Spark you really do not need to know that much Scala but you do need to understand the essence of it. Thanks for the good discussion! :-) Ognen
Re: Spark or MR, Scala or Java?
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers k/ P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote: Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent to the framework. Cheers k/ P.S: Now Reply to ALL On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote: Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Scala is arguably more elegant and less verbose than Java. However, Scala is also a complex language with a lot of details and tidbits and one-offs that you just have to remember. It is sometimes difficult to make a decision whether what you wrote is the using the language features most effectively or if you missed out on an available feature that could have made the code better or more concise. For Spark you really do not need to know that much Scala but you do need to understand the essence of it. Thanks for the good discussion! :-) Ognen
Re: SparkSQL Timestamp query failure
Hey Daoyuan, following your suggestion I obtain the same result as when I do: where l.timestamp = '2012-10-08 16:10:36.0’ what happens using either your suggestion or simply using single quotes as I just typed in the example before is that the query does not fail but it doesn’t return anything either as it should. If I do a simple : SELECT timestamp FROM Logs limit 5).collect.foreach(println) I get: [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:41.0] [2012-10-08 16:10:41.0] that is why I am sure that putting one of those timestamps should not return an empty arrray. Id really love to find a solution to this problem. Since Spark supports Timestamp it should provide simple comparison actions with them in my opinion. Any other help would be greatly appreciated. Alessandro On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [mailto:ale.panebia...@me.com] Sent: Sunday, November 23, 2014 12:11 AM To: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve this. Any other hint by any chance? Thanks, Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: SparkSQL Timestamp query failure
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone shifting problem. (And it’s not mandatory, but you’d better change the field name from “timestamp” to something else, as “timestamp” is the keyword of data type in Hive/Spark SQL.) From: Alessandro Panebianco [mailto:ale.panebia...@me.com] Sent: Monday, November 24, 2014 11:12 AM To: Wang, Daoyuan Cc: u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure Hey Daoyuan, following your suggestion I obtain the same result as when I do: where l.timestamp = '2012-10-08 16:10:36.0’ what happens using either your suggestion or simply using single quotes as I just typed in the example before is that the query does not fail but it doesn’t return anything either as it should. If I do a simple : SELECT timestamp FROM Logs limit 5).collect.foreach(println) I get: [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:41.0] [2012-10-08 16:10:41.0] that is why I am sure that putting one of those timestamps should not return an empty arrray. Id really love to find a solution to this problem. Since Spark supports Timestamp it should provide simple comparison actions with them in my opinion. Any other help would be greatly appreciated. Alessandro On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote: Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [mailto:ale.panebia...@me.com] Sent: Sunday, November 23, 2014 12:11 AM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: Re: SparkSQL Timestamp query failure Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve this. Any other hint by any chance? Thanks, Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html Sent from the Apache Spark User List mailing list archive at Nabble.comhttp://Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
2 spark streaming questions
Hi, Dear Spark Streaming Developers and Users, We are prototyping using spark streaming and hit the following 2 issues thatI would like to seek your expertise. 1) We have a spark streaming application in scala, that reads data from Kafka intoa DStream, does some processing and output a transformed DStream. If for some reasonthe Kafka connection is not available or timed out, the spark streaming job will startto send empty RDD afterwards. The log is clean w/o any ERROR indicator. I googled around and this seems to be a known issue.We believe that spark streaming infrastructure should either retry or return error/exception.Can you share how you handle this case? 2) We would like implement a spark streaming job that join an 1 minute duration DStream of real time eventswith a metadata RDD that was read from a database. The metadata only changes slightly each day in the database.So what is the best practice of refresh the RDD daily keep the streaming join job running? Is this do-able as of spark 1.1.0? Thanks. Tian
Re: Spark or MR, Scala or Java?
Thanks a ton Ashishsanjay From: Ashish Rangole arang...@gmail.com To: Sanjay Subramanian sanjaysubraman...@yahoo.com Cc: Krishna Sankar ksanka...@gmail.com; Sean Owen so...@cloudera.com; Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org Sent: Sunday, November 23, 2014 11:03 AM Subject: Re: Spark or MR, Scala or Java? This being a very broad topic, a discussion can quickly get subjective. I'll try not to deviate from my experiences and observations to keep this thread useful to those looking for answers. I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My observations are below. Spark and Hadoop MR:1. There doesn't have to be a dichotomy between Hadoop ecosystem and Spark since Spark is a part of it. 2. Spark or Hadoop MR, there is no getting away from learning how partitioning, input splits, and shuffle process work. In order to optimize performance, troubleshoot and design software one must know these. I recommend reading first 6-7 chapters of Hadoop The definitive Guide book to develop initial understanding. Indeed knowing a couple of divide and conquer algorithms is a pre-requisite and I assume everyone on this mailing list is very familiar :) 3. Having used a lot of different APIs and layers of abstraction for Hadoop MR, my experience progressing from MR Java API -- Cascading -- Scalding is that each new API looks simpler than the previous one. However, Spark API and abstraction has been simplest. Not only for me but those who I have seen start with Hadoop MR or Spark first. It is easiest to get started and become productive with Spark with the exception of Hive for those who are already familiar with SQL. Spark's ease of use is critical for teams starting out with Big Data. 4. It is also extremely simple to chain multi-stage jobs in Spark, you do it without even realizing by operating over RDDs. In Hadoop MR, one has to handle it explicitly. 5. Spark has built-in support for graph algorithms (including Bulk Synchronous Parallel processing BSP algorithms e.g. Pregel), Machine Learning and Stream processing. In Hadoop MR you need a separate library/Framework for each and it is non-trivial to combine multiple of these in the same application. This is huge! 6. In Spark one does have to learn how to configure the memory and other parameters of their cluster. Just to be clear, similar parameters exist in MR as well (e.g. shuffle memory parameters) but you don't *have* to learn about tuning them until you have jobs with larger data size jobs. In Spark you learn this by reading the configuration and tuning documentation followed by experimentation. This is an area of Spark where things can be better. Java or Scala : I knew Java already yet I learnt Scala when I came across Spark. As others have said, you can get started with a little bit of Scala and learn more as you progress. Once you have started using Scala for a few weeks you would want to stay with it instead of going back to Java. Scala is arguably more elegant and less verbose than Java which translates into higher developer productivity and more maintainable code. Myth: Spark is for in-memory processing *only*. This is a common beginner misunderstanding. Sanjay: Spark uses Hadoop API for performing I/O from file systems (local, HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and RecordReader with Spark that you use with Hadoop for your multi-line record format. See SparkContext APIs. Just like Hadoop, you will need to make sure that your files are split at record boundaries. Hope this is helpful. On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis. To get me jumpstarted into Spark I started this gitHub where there is IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will keep adding to that. I dont know scala and I am learning that too to help me use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git Philosophically speaking its possibly not a good idea to take an either/or approach to technology...Like its never going to be either RDBMS or NOSQL (If the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a day for some reason u may not be as upset...but if the Oracle/Db2 systems behind Wells Fargo show $100 LESS in your account due to an database error, you will be PANIC-ing). So its the same case with Spark or Hadoop. I can speak for myself. I have a usecase for processing old logs that are multiline (i.e. they have a [begin_timestamp_logid] and [end_timestamp_logid] and have many lines in between. In Java Hadoop I created custom RecordReaders to solve this. I still dont know how to do this in Spark.
Re: SparkSQL Timestamp query failure
Cheng thanks, thanks to you I found out that the problem as you guessed was a precision one. 2012-10-08 16:10:36 instead of 2012-10-08 16:10:36.0 Thanks again. Alessandro On Nov 23, 2014, at 11:10 PM, Cheng, Hao [via Apache Spark User List] ml-node+s1001560n19613...@n3.nabble.com wrote: Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone shifting problem. (And it’s not mandatory, but you’d better change the field name from “timestamp” to something else, as “timestamp” is the keyword of data type in Hive/Spark SQL.) From: Alessandro Panebianco [mailto:[hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=0] Sent: Monday, November 24, 2014 11:12 AM To: Wang, Daoyuan Cc: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=1 Subject: Re: SparkSQL Timestamp query failure Hey Daoyuan, following your suggestion I obtain the same result as when I do: where l.timestamp = '2012-10-08 16:10:36.0’ what happens using either your suggestion or simply using single quotes as I just typed in the example before is that the query does not fail but it doesn’t return anything either as it should. If I do a simple : SELECT timestamp FROM Logs limit 5).collect.foreach(println) I get: [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:36.0] [2012-10-08 16:10:41.0] [2012-10-08 16:10:41.0] that is why I am sure that putting one of those timestamps should not return an empty arrray. Id really love to find a solution to this problem. Since Spark supports Timestamp it should provide simple comparison actions with them in my opinion. Any other help would be greatly appreciated. Alessandro On Nov 23, 2014, at 8:10 PM, Wang, Daoyuan [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=2 wrote: Hi, I think you can try cast(l.timestamp as string)='2012-10-08 16:10:36.0' Thanks, Daoyuan -Original Message- From: whitebread [[hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=3] Sent: Sunday, November 23, 2014 12:11 AM To: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=4 Subject: Re: SparkSQL Timestamp query failure Thanks for your answer Akhil, I have already tried that and the query actually doesn't fail but it doesn't return anything either as it should. Using single quotes I think it reads it as a string and not as a timestamp. I don't know how to solve this. Any other hint by any chance? Thanks, Alessandro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19554.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=5 For additional commands, e-mail: [hidden email] x-msg://2/user/SendEmail.jtp?type=nodenode=19613i=6 If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19613.html http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19613.html To unsubscribe from SparkSQL Timestamp query failure, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19502code=YWxlLnBhbmViaWFuY29AbWUuY29tfDE5NTAyfC00MjA1ODk4MTE=. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p19616.html Sent from the Apache Spark User List mailing list archive at Nabble.com.