Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani
Thanks, It worked !!! On Tue, May 24, 2016 at 1:14 AM, Marcelo Vanzin wrote: > On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani > wrote: > > I am passing hive-site.xml through --files option. > > You need hive-site-xml in Spark's classpath

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang
Spark actually used to depend on Akka. Unfortunately this brought in all of Akka's dependencies (in addition to Spark's already quite complex dependency graph) and, as Todd mentioned, led to conflicts with projects using both Spark and Akka. It would probably be possible to use Akka and shade it

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread ayan guha
Hi Thanks for very useful stats. Did you have any benchmark for using Spark as backend engine for Hive vs using Spark thrift server (and run spark code for hive queries)? We are using later but it will be very useful to remove thriftserver, if we can. On Tue, May 24, 2016 at 9:51 AM, Jörn

Re: how to config spark thrift jdbc server high available

2016-05-23 Thread 7??
Dear Toadd thankyou for your reply. I don't know how to reply your message from Nabble.com. Can you tell me which jira works on spark thrift server HA and how to reply from Nabble.com. Thanks a lot -- Original -- From: "Todd";; Date:

Re: how to config spark thrift jdbc server high available

2016-05-23 Thread 7??
I already found jira SPARK-11100 Can you tell me how to reply from Nabble.com. thanks a lot -- Original -- From: "7??";<578967...@qq.com>; Date: Tue, May 24, 2016 09:59 AM To: "Todd"; Cc: "user"; Subject: Re: how

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Jörn Franke
Hi Mich, I think these comparisons are useful. One interesting aspect could be hardware scalability in this context. Additionally different type of computations. Furthermore, one could compare Spark and Tez+llap as execution engines. I have the gut feeling that each one can be justified by

RE: Timed aggregation in Spark

2016-05-23 Thread Ewan Leith
Rather than open a connection per record, if you do a DStream foreachRDD at the end of a 5 minute batch window http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams then you can do a rdd.foreachPartition to get the RDD partitions. Open a connection

Re: Hive_context

2016-05-23 Thread Arun Natva
Can you try a hive JDBC java client from eclipse and query a hive table successfully ? This way we can narrow down where the issue is ? Sent from my iPhone > On May 23, 2016, at 5:26 PM, Ajay Chander wrote: > > I downloaded the spark 1.5 untilities and exported

Re: Hive_context

2016-05-23 Thread Ajay Chander
I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to it. I copied all the cluster configuration files(hive-site.xml, hdfs-site.xml etc files) inside the ${SPARK_HOME}/conf/ . My application looks like below, public class SparkSqlTest { public static void main(String[] args)

Re: Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal
I don't think this is solving the problem. So here are the issues: 1) How do we push entire data to vertica. Opening a connection per record will be too costly 2) If a key doesn't come again, how do we push this to vertica 3) How do we schedule the dumping of data to avoid loading too much data in

Re: Timed aggregation in Spark

2016-05-23 Thread Ofir Kerker
Yes, check out mapWithState:https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html _ From: Nikhil Goyal Sent: Monday, May 23, 2016 23:28 Subject: Timed aggregation in Spark To:

Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal
Hi all, I want to aggregate my data for 5-10 min and then flush the aggregated data to some database like vertica. updateStateByKey is not exactly helpful in this scenario as I can't flush all the records at once, neither can I clear the state. I wanted to know if anyone else has faced a similar

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work?  Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadeh wrote: Have a look at this thread Dr Mich

Hive_context

2016-05-23 Thread Ajay Chander
Hi Everyone, I am building a Java Spark application in eclipse IDE. From my application I want to use hiveContext to read tables from the remote Hive(Hadoop cluster). On my machine I have exported $HADOOP_CONF_DIR = {$HOME}/hadoop/conf/. This path has all the remote cluster conf details like

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Amit Sela
See SPARK-15489 I'll try to figure this one out as well, any leads ? "immediate suspects" ? Thanks, Amit On Mon, May 23, 2016 at 10:27 PM Michael Armbrust wrote: > Can you open a JIRA? > > On Sun, May 22, 2016 at 2:50

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Marcelo Vanzin
On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani wrote: > I am passing hive-site.xml through --files option. You need hive-site-xml in Spark's classpath too. Easiest way is to copy / symlink hive-site.xml in your Spark's conf directory. -- Marcelo

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani
Thanks Doug, I have all the 4 configs (mentioned by you) already in my hive-site.xml. Do I need to create a hive-site.xml in spark conf directory (it is not there by default in 1.6.1)? Please suggest. On Mon, May 23, 2016 at 9:53 PM, Doug Balog wrote: > I have a

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust
Can you open a JIRA? On Sun, May 22, 2016 at 2:50 PM, Amit Sela wrote: > I've been using Encoders with Kryo to support encoding of generically > typed Java classes, mostly with success, in the following manner: > > public static Encoder encoder() { > return

Re: Dataset API and avro type

2016-05-23 Thread Michael Armbrust
if you are using the kryo encoder, you can only use it to to map to/from kryo encoded binary data. This is because spark does not understand kryo's encoding, its just using it as an opaque blob of bytes. On Mon, May 23, 2016 at 1:28 AM, Han JU wrote: > Just one more

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Jakob Odersky
Spark actually used to depend on Akka. Unfortunately this brought in all of Akka's dependencies (in addition to Spark's already quite complex dependency graph) and, as Todd mentioned, led to conflicts with projects using both Spark and Akka. It would probably be possible to use Akka and shade it

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh
Have a look at this thread Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 23 May 2016 at 09:10, Mich

Re: How to set the degree of parallelism in Spark SQL?

2016-05-23 Thread Xinh Huynh
To the original question of parallelism and executors: you can have a parallelism of 200, even with 2 executors. In the Spark UI, you should see that the number of _tasks_ is 200 when your job involves shuffling. Executors vs. tasks: http://spark.apache.org/docs/latest/cluster-overview.html Xinh

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Doug Balog
I have a custom hive-site.xml for spark in sparks conf directory. These properties are the minimal ones that you need for spark, I believe. hive.metastore.kerberos.principal = copy from your hive-site.xml, i.e. "hive/_h...@foo.com" hive.metastore.uris = copy from your hive-site.xml, i.e.

Re: What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread Mich Talebzadeh
depends on what you are using it for. Three parameters are important: 1. Batch interval 2. WindowsDuration 3. SlideDuration Batch interval is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example,

Re: What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread nsalian
Thanks for the question. What kind of data rate are you expecting to receive? - Neelesh S. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007p27008.html

What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread YaoPau
Just wondering how small the microbatches can be, and any best practices on the smallest value that should be used in production. For example, any issue with running it at 0.01 seconds? -- View this message in context:

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

TFIDF question

2016-05-23 Thread Pasquinell Urbani
Hi all, I'm following an TF-IDF example but I’m having some issues that i’m not sure how to fix. The input is the following val test = sc.textFile("s3n://.../test_tfidf_products.txt") test.collect.mkString("\n") which prints test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at

sqlContext.read.format("libsvm") not working with spark 1.6+

2016-05-23 Thread dbspace
I have download the the precompile version of apache spark and try to use sqlContext.read.format("libsvm") but I get the java.lang.ClassNotFoundException: Failed to load class for data source: libsvm. I have post this also in stackoverflow forum libsvm stackoverflow

Error making REST call from streaming app

2016-05-23 Thread Afshartous, Nick
Hi, We got the following exception trying to initiate a REST call from the Spark app. This is running Spark 1.5.2 in AWS / Yarn. Its only happened one time during the course of a streaming app that has been running for months. Just curious if anyone could shed some more light on root

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Ted Yu
Can you describe the kerberos issues in more detail ? Which release of YARN are you using ? Cheers On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani < cpbhagt...@gmail.com> wrote: > Hi, > > My Spark job is failing with kerberos issues while creating hive context > in yarn-cluster mode.

Re:how to config spark thrift jdbc server high available

2016-05-23 Thread Todd
There is a jira that works on spark thrift server HA, the patch works,but still hasn't merged into the master branch. At 2016-05-23 20:10:26, "qmzhang" <578967...@qq.com> wrote: >Dear guys, please help... > >In hive,we can enable hiveserver2 high available by using dynamic service

Re:why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Todd
As far as I know, there would be Akka version conflicting issue when using Akka as spark streaming source. At 2016-05-23 21:19:08, "Chaoqiang" wrote: >I want to know why spark 1.6 use Netty instead of Akka? Is there some >difficult problems which Akka can not

why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang
I want to know why spark 1.6 use Netty instead of Akka? Is there some difficult problems which Akka can not solve, but using Netty can solve easily? If not, can you give me some references about this changing? Thank you -- View this message in context:

why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang
I want to know why spark 1.6 use Netty instead of Akka? Is there some difficult problems which Akka can not solve, but using Netty can solve easily? If not, can you give me some references about this changing? Thank you. -- View this message in context:

Re: How to set the degree of parallelism in Spark SQL?

2016-05-23 Thread Mathieu Longtin
Since the default is 200, I would guess you're only running 2 executors. Try to verify how many executor you are actually running with the web interface (port 8080 where the master is running). On Sat, May 21, 2016 at 11:42 PM Ted Yu wrote: > Looks like an equal sign is

how to config spark thrift jdbc server high available

2016-05-23 Thread qmzhang
Dear guys, please help... In hive,we can enable hiveserver2 high available by using dynamic service discovery for HiveServer2. But how to enable spark thriftserver high available? Thank you for your help -- View this message in context:

odd python.PythonRunner Times values?

2016-05-23 Thread Adrian Bridgett
I'm seeing output like this on our mesos spark slaves: 16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1137, boot = -590, init = 593, finish = 1134 16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1652, boot = -446, init = 481, finish = 1617 This seems to be coming from

spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-23 Thread chandan prakash
Hi, I am able to do logging for driver but not for executor. I am running spark streaming under mesos. Want to do log4j logging separately for driver and executor. Used the below option in spark-submit command : --driver-java-options

Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani
Hi, My Spark job is failing with kerberos issues while creating hive context in yarn-cluster mode. However it is running with yarn-client mode. My spark version is 1.6.1 I am passing hive-site.xml through --files option. I tried searching online and found that the same issue is fixed with the

Re: How spark depends on Guava

2016-05-23 Thread Jacek Laskowski
Hi Todd, It's used heavily for thread pool executors for one. Don't know about other uses. Jacek On 23 May 2016 5:49 a.m., "Todd" wrote: > Hi, > In the spark code, guava maven dependency scope is provided, my question > is, how spark depends on guava during runtime? I looked

Re:Re: How spark depends on Guava

2016-05-23 Thread Todd
Thanks Mat. When you look at the spark assembly jar(such as spark-assembly-1.6.0-hadoop2.6.1.jar), you will find that there is very few classes belonging to guava library. So i am wondering where guava library comes into play during run-time. At 2016-05-23 15:42:51, "Mat Schaffer"

Re: Spark for offline log processing/querying

2016-05-23 Thread Renato Marroquín Mogrovejo
We also did some benchmarking using analytical queries similar to TPC-H both with Spark and Presto, and our conclussion was that Spark is a great general solution but for analytical SQL queries it is still not there yet. I mean for 10 or 100GB of data you will get your results back but with Presto

Re: How to integrate Spark with OpenCV?

2016-05-23 Thread Jishnu Prathap
Hi Purbanir Integrating Spark with OpenCV was pretty straightforward .Only 2 things to keep in mind ,OpenCV should be installed in each worker and Sytem.loadlibrary() should be written in the program such that it is invoked for each worker once. Thanks & Regards Jishnu Prathap

Spark Streaming - Exception thrown while writing record: BlockAdditionEvent

2016-05-23 Thread Ewan Leith
As we increase the throughput on our Spark streaming application, we're finding we hit errors with the WriteAheadLog, with errors like this: 16/05/21 20:42:21 WARN scheduler.ReceivedBlockTracker: Exception thrown while writing record:

Re: Dataset API and avro type

2016-05-23 Thread Han JU
Just one more question: does Dataset suppose to be able to cast data to an avro type? For a very simple format (a string and a long), I can cast it to a tuple or case class, but not an avro type (also contains only a string and a long). The error is like this for this very simple type: ===

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh
Hi Timur and everyone. I will answer your first question as it is very relevant 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations

Re: Spark for offline log processing/querying

2016-05-23 Thread Mat Schaffer
It's only really mildly interactive. When I used presto+hive in the past (just a consumer not an admin) it seemed to be able to provide answers within ~2m even for fairly large data sets. Hoping I can get a similar level of responsiveness with spark. Thanks, Sonal! I'll take a look at the example

Re: Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-23 Thread Virgil Palanciuc
It doesn't. However, if you have a very large number of keys, with a small number of very large keys, you can do one of the following: A. Use a custom partitioner that counts the number of items in a key and avoids putting large keys together; alternatively, if feasible (and needed), include part

Re: Spark for offline log processing/querying

2016-05-23 Thread Jörn Franke
Do you want to replace ELK by Spark? Depending on your queries you could do as you proposed. However, many of the text analytics queries will probably be much faster on ELK. If your queries are more interactive and not about batch processing then it does not make so much sense. I am not sure