Re: spark-sql use case beginner question

2017-03-08 Thread nancy henry
okay what is difference between keep set hive.execution.engine =spark and running the script through hivecontext.sql Show quoted text On Mar 9, 2017 8:52 AM, "ayan guha" wrote: > Hi > > Subject to your version of Hive & Spark, you may want to set >

[no subject]

2017-03-08 Thread sathyanarayanan mudhaliyar
code: directKafkaStream.foreachRDD(rdd -> { rdd.foreach(record -> { messages1.add(record._2); }); JavaRDD lines = sc.parallelize(messages1);

Re: spark-sql use case beginner question

2017-03-08 Thread ayan guha
Hi Subject to your version of Hive & Spark, you may want to set hive.execution.engine=spark as beeline command line parameter, assuming you are running hive scripts using beeline command line (which is suggested practice for security purposes). On Thu, Mar 9, 2017 at 2:09 PM, nancy henry

spark-sql use case beginner question

2017-03-08 Thread nancy henry
Hi Team, basically we have all data as hive tables ..and processing it till now in hive on MR.. now that we have hivecontext which can run hivequeries on spark, we are making all these complex hive scripts to run using hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically running

Re: spark-sql use case beginner question

2017-03-08 Thread nancy henry
Hi Team, basically we have all data as hive tables ..and processing it till now in hive on MR.. now that we have hivecontext which can run hivequeries on spark, we are making all these complex hive scripts to run using hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically running

[no subject]

2017-03-08 Thread sathyanarayanan mudhaliyar
code: directKafkaStream.foreachRDD(rdd -> { rdd.foreach(record -> { messages1.add(record._2); }); JavaRDD lines = sc.parallelize(messages1);

Re: question on Write Ahead Log (Spark Streaming )

2017-03-08 Thread Saisai Shao
IIUC, your scenario is quite like what currently ReliableKafkaReceiver does. You can only send ack to the upstream source after WAL is persistent, otherwise because of asynchronization of data processing and data receiving, there's still a chance data could be lost if you send out ack before WAL.

Re: Spark Beginner: Correct approach for use case

2017-03-08 Thread Allan Richards
Thanks for the feedback everyone. We've had a look at different SQL based solutions, and have got good performance out of them, but some of the reports we make can't be generated with a single bit of SQL. This is just an investigation to see if Spark is a viable alternative. I've got another

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-08 Thread Swapnil Shinde
Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all tasks on file system. eg, task1 - partition1, partition2, partition3 task2 - partition1,

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-03-08 Thread Muhammad Haseeb Javed
I was talking about the Kafka binary if using to run the Kafka server (broker) with. The version for that binary is kafka_2.10-0.8.2.1 and Spark is 2.0.2 is built with 2.11. So I am using the Kafka Connector that Spark is using internally to communicate with the broker is also built with Scala

Apparent memory leak involving count

2017-03-08 Thread Facundo Domínguez
Hello, I'm running JavaRDD.count() repeteadly on a small RDD, and it seems to increase the size of the Java heap over time until the default limit is reached and an OutOfMemoryException is thrown. I'd expect this program to run in constant space, and the problem carries over to some more

question on Write Ahead Log (Spark Streaming )

2017-03-08 Thread kant kodali
Hi All, I am using a Receiver based approach. And I understand that spark streaming API's will convert the received data from receiver into blocks and these blocks that are in memory are also stored in WAL if one enables it. my upstream source which is not Kafka can also replay by which I mean if

Re: spark executor memory, jvm config

2017-03-08 Thread TheGeorge1918 .
OK, I found the problem. There is a typo in my configuration. As a result, the executor dynamic allocation is not disabled. So, the executors get killed and requested from time to time. All good now. On Wed, Mar 8, 2017 at 2:45 PM, TheGeorge1918 . wrote: > Hello all, >

spark executor memory, jvm config

2017-03-08 Thread TheGeorge1918 .
Hello all, I was running some spark job and some executors failed without error info. The executors were dead and new executors were requested but on the spark web UI, no failure found. Normally, if it's memory issue, I could find OOM ther, but not this time. Configuration: 1. each executor has

Spark is inventing its own AWS secret key

2017-03-08 Thread Jonhy Stack
Hi, I'm trying to read a s3 bucket from Spark and up until today Spark always complain that the request return 403 hadoopConf = spark_context._jsc.hadoopConfiguration() hadoopConf.set("fs.s3a.access.key", "ACCESSKEY") hadoopConf.set("fs.s3a.secret.key", "SECRETKEY")

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-08 Thread rok
My guess is that the UI serialization times show the Java side only. To get a feeling for the python pickling/unpickling, use the show_profiles() method of the SparkContext instance: http://spark.apache. org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.show_profiles That will show you