Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu
Can you try like this in your sbt: val spark_version = "1.5.2" val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact = "servlet-api") val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty") libraryDependencies ++= Seq( "org.apache.spark" %%

HiveContext creation failed with Kerberos

2015-12-07 Thread Neal Yin
Hi I am using Spark 1.5.1 with CDH 5.4.2. My cluster is kerberos protected. Here is pseudocode for what I am trying to do. ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(“foo", “…") ugi.doAs( new PrivilegedExceptionAction() { val sparkConf: SparkConf = createSparkConf(…)

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
Try Phoenix from Cloudera parcel distribution https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/ They may have better Kerberos support .. On Tue, Dec 8, 2015 at 12:01 AM Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Yes, its a kerberized

NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Sunil Tripathy
I am getting the following exception when I use spark-submit to submit a spark streaming job. Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia
Yes, its a kerberized cluster and ticket was generated using kinit command before running spark job. That's why Spark on hbase worked but when phoenix is used to get the connection to hbase, it does not pass the authentication to all nodes. Probably it is not handled in Phoenix version 4.3 or

Spark with MapDB

2015-12-07 Thread Ramkumar V
Hi, I'm running java over spark in cluster mode. I want to apply filter on javaRDD based on some previous batch values. if i store those values in mapDB, is it possible to apply filter during the current batch ? *Thanks*,

Unable to acces hive table (created through hive context) in hive console

2015-12-07 Thread Divya Gehlot
Hi, I am new bee to Spark and using HDP 2.2 which comes with Spark 1.3.1 I tried following code example > import org.apache.spark.sql.SQLContext > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > > val personFile = "/user/hdfs/TestSpark/Person.csv" >

RE: parquet file doubts

2015-12-07 Thread Singh, Abhijeet
Yes, Parquet has min/max. From: Cheng Lian [mailto:l...@databricks.com] Sent: Monday, December 07, 2015 11:21 AM To: Ted Yu Cc: Shushant Arora; user@spark.apache.org Subject: Re: parquet file doubts Oh sorry... At first I meant to cc spark-user list since Shushant and I had been discussed some

Re: Not able to receive data in spark from rsyslog

2015-12-07 Thread Akhil Das
Just make sure you are binding on the correct interface. - java.net.ConnectException: Connection refused​ Means spark was not able to connect to that host/port. You can validate it by telneting to that host/port. ​ Thanks Best Regards On Fri, Dec 4, 2015 at 1:00 PM, masoom alam

[SPARK] Obtaining matrices of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job,

spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, Is there a spark sql function which returns current time stamp Example:- In Impala:- select NOW(); SQL Server:- select GETDATE(); Netezza:- select NOW(); Thanks Sri -- View this message in context:

Re: How the cores are used in Directstream approach

2015-12-07 Thread Akhil Das
You will have to do a repartition after creating the dstream to utilize all cores. directStream keeps exactly the same partitions as in kafka for spark. Thanks Best Regards On Thu, Dec 3, 2015 at 9:42 AM, Charan Ganga Phani Adabala < char...@eiqnetworks.com> wrote: > Hi, > > We have* 1 kafka

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das
UpdateStateByKey and your batch data could be filling up your executor memory and hence it might be hitting the disk, you can verify it by looking at the memory footprint while your job is running. Looking at the executor logs will also give you a better understanding of whats going on. Thanks

Re: Predictive Modeling

2015-12-07 Thread Akhil Das
You can write a simple python script to process the 1.5GB dataset, use the pandas library for building your predictive model. Thanks Best Regards On Fri, Dec 4, 2015 at 3:02 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Hi, > I'm very much interested to make a predictive model

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Sean Owen
You can't broadcast an RDD to begin with, and can't use RDDs inside RDDs. They are really driver-side concepts. Yes that's how you'd use a broadcast of anything else though, though you need to reference ".value" on the broadcast. The 'if' is redundant in that example, and if it's a map- or

Re: Spark applications metrics

2015-12-07 Thread Akhil Das
Usually your application is composed of jobs and jobs are composed of tasks, on the task level you can see how much read/write was happened from the stages tab of your driver ui. Thanks Best Regards On Fri, Dec 4, 2015 at 6:20 PM, patcharee wrote: > Hi > > How can I

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-07 Thread Akhil Das
Whats in your SparkIsAwesome class? Just make sure that you are giving enough partition to spark to evenly distribute the job throughout the cluster. Try submitting the job this way: ~/spark/bin/spark-submit --executor-cores 10 --executor-memory 5G --driver-memory 5G --class

FW: Managed to make Hive run on Spark engine

2015-12-07 Thread Mich Talebzadeh
For those interested From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 06 December 2015 20:33 To: u...@hive.apache.org Subject: Managed to make Hive run on Spark engine Thanks all especially to Xuefu.for contributions. Finally it works, which means don’t give up until it works :)

Obtaining metrics of an individual Spark job

2015-12-07 Thread diplomatic Guru
Hello team, I need to present the Spark job performance to my management. I could get the execution time by measuring the starting and finishing time of the job (includes overhead). However, not sure how to get the other matrices e.g cpu, i/o, memory etc.. I want to measure the individual job,

Re: spark sql current time stamp function ?

2015-12-07 Thread kali.tumm...@gmail.com
I found a way out. import java.text.SimpleDateFormat import java.util.Date; val format = new SimpleDateFormat("-M-dd hh:mm:ss") val testsql=sqlContext.sql("select column1,column2,column3,column4,column5 ,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new Date(

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
Does unix_timestamp() satisfy your needs ? See sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala On Mon, Dec 7, 2015 at 6:54 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > I found a way out. > > import java.text.SimpleDateFormat > import java.util.Date; > > val

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs
Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quota. A 1.5G input set becomes >30G of spilled data on disk. I looked into how

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Sean Owen
I'm not sure if this is available in Python but from 1.3 on you should be able to call ALS.setFinalRDDStorageLevel with level "none" to ask it to unpersist when it is done. On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs wrote: > Jonathan, > Did you ever get to the bottom of

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Akhil Das
Something like this? val broadcasted = sc.broadcast(...) RDD2.filter(value => { //simply use *broadcasted* if(broadcasted.contains(value)) true }) Thanks Best Regards On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar < abhishek.shivku...@bigdatapartnership.com> wrote: > Hi, > > I have

Spark sql random number or sequence numbers ?

2015-12-07 Thread kali.tumm...@gmail.com
Hi All, I did implemented random_numbers using scala spark , is there a function to get row_number equivalent in spark sql ? example:- sql server:-row_number() Netezza:- sequence number mysql:- sequence number Example:- val testsql=sqlContext.sql("select

Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das
Not a direct answer but you can create a big fat jar combining all the classes in the three jars and pass it. Thanks Best Regards On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan wrote: > Hello > > I have a question about AWS CLI for people who use it. > > I create a

Re: Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Akhil Das
Did you read http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support Thanks Best Regards On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh wrote: > Hi, > > > > > > I am trying to make Hive work with Spark. > > > > I have been told that I

Re: Where to implement synchronization is GraphX Pregel API

2015-12-07 Thread Robineast
Not sure exactly what your asking but: 1) if you are asking do you need to implement synchronisation code - no that is built into the call to Pregel 2) if you are asking how is synchronisation implemented in GraphX - the superstep starts and ends with the beginning and end of a while loop in the

Re: parquet file doubts

2015-12-07 Thread Shushant Arora
how to read it using parquet tools. When I did hadoop parquet.tools.Main meta prquetfilename I didn't get any info of min and max values. How can I see parquet version of my file.Is min max respective to some parquet version or available since beginning? On Mon, Dec 7, 2015 at 6:51 PM, Singh,

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Nisrina Luthfiyati
Hi Jacek, thank you for your answer. I looked at TaskSchedulerImpl and TaskSetManager and it does looked like tasks are directly sent to executors. Also would love to be corrected if mistaken as I have little knowledge about Spark internals and very new at scala. On Tue, Dec 1, 2015 at 1:16 AM,

RE: Broadcasting a parquet file using spark and python

2015-12-07 Thread Shuai Zheng
Hi Michael, Thanks for feedback. I am using version 1.5.2 now. Can you tell me how to enforce the broadcast join? I don’t want to let the engine to decide the execution path of join. I want to use hint or parameter to enforce broadcast join (because I also have some cases are inner

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list. Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s

Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Mich Talebzadeh
Thanks sorted. Actually I used version 1.3.1 and now I managed to make it work as Hive execution engine. Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15

Re: spark.authenticate=true YARN mode doesn't work

2015-12-07 Thread Marcelo Vanzin
Prasad, As I mentioned in my first reply, you need to enable spark.authenticate in the shuffle service's configuration too for this to work. It doesn't seem like you have done that. On Sun, Dec 6, 2015 at 5:09 PM, Prasad Reddy wrote: > Hi Marcelo, > > I am attaching all

How to create dataframe from SQL Server SQL query

2015-12-07 Thread Wang, Ningjun (LNG-NPV)
How can I create a RDD from a SQL query against SQLServer database? Here is the example of dataframe http://spark.apache.org/docs/latest/sql-programming-guide.html#overview val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" ->

Re: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Sujit Pal
Hi Ningjun, Haven't done this myself, saw your question and was curious about the answer and found this article which you might find useful: http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ According this article, you can pass in your SQL statement

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Jacek Laskowski
Hi, That's my understanding, too. Just spent an entire morning today to check it out and would be surprised to hear otherwise. Pozdrawiam, Jacek -- Jacek Laskowski | https://medium.com/@jaceklaskowski/ | http://blog.jaceklaskowski.pl Mastering Spark

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to

RE: spark sql current time stamp function ?

2015-12-07 Thread Mich Talebzadeh
Or try this cast(from_unixtime(unix_timestamp()) AS timestamp HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of

How to change StreamingContext batch duration after loading from checkpoint

2015-12-07 Thread yam
Is there a way to change the streaming context batch interval after reloading from checkpoint? I would like to be able to change the batch interval after restarting the application without loosing the checkpoint of course. Thanks! -- View this message in context:

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
Have you tried using monotonicallyIncreasingId ? Cheers On Mon, Dec 7, 2015 at 7:56 AM, Sri wrote: > Thanks , I found the right function current_timestamp(). > > different Question:- > Is there a row_number() function in spark SQL ? Not in Data frame just > spark SQL? >

Re: spark sql current time stamp function ?

2015-12-07 Thread Sri
Thanks , I found the right function current_timestamp(). different Question:- Is there a row_number() function in spark SQL ? Not in Data frame just spark SQL? Thanks Sri Sent from my iPhone > On 7 Dec 2015, at 15:49, Ted Yu wrote: > > Does unix_timestamp() satisfy

Task hung on SocketInputStream.socketRead0 when reading large a mount of data from AWS S3

2015-12-07 Thread Sa Xiao
Hi, We encounter a problem very similar to this one: https://www.mail-archive.com/search?l=user@spark.apache.org=subject:%22Spark+task+hangs+infinitely+when+accessing+S3+from+AWS%22=newest=1 When reading large amount of data from S3, one or several tasks hung. It doesn't happen every time,

persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot
Hi, I am new bee to Spark. Could somebody guide me how can I persist my spark RDD results in Hive using SaveAsTable API. Would appreciate if you could provide the example for hive external table. Thanks in advance.

Re: mllib.recommendations.als recommendForAll not ported to ml?

2015-12-07 Thread Nick Pentreath
I can't find a JIRA for this, though there are some related to the existing MLlib implementation (https://issues.apache.org/jira/browse/SPARK-10802 and https://issues.apache.org/jira/browse/SPARK-11968) - would be good to port it over, and in the process also speed it up as per SPARK-11968, and

Available options for Spark REST API

2015-12-07 Thread sunil m
Dear Spark experts! I would like to know the best practices used for invoking spark jobs via REST API. We tried out the hidden REST API mentioned here: http://arturmkrtchyan.com/apache-spark-hidden-rest-api It works fine for spark standalone mode but does not seem to be working when i specify

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread UMESH CHAUDHARY
currently saveAsTable will create Hive Internal table by default see here If you want to save it as external table, use saveAsParquetFile and create an external hive table on that parquet file. On Mon,

How to config the log in Spark

2015-12-07 Thread Guillermo Ortiz
I don't get to activate the logs for my classes. I'm using CDH 5.4 with Spark 1.3.0 I have a class in Scala with some log.debug, I create a class to log: package example.spark import org.apache.log4j.Logger object Holder extends Serializable { @transient lazy val log =

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
If your RDD is JSON format, that’s easy. val df = sqlContext.read.json(rdd) df.saveAsTable(“your_table_name") > On Dec 7, 2015, at 5:28 PM, Divya Gehlot wrote: > > Hi, > I am new bee to Spark. > Could somebody guide me how can I persist my spark RDD results in Hive

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
I suppose your output data is “ORC”, and want to save to hive database: test, external table name is : testTable import scala.collection.immutable sqlContext.createExternalTable(“test.testTable", "org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata")) > On Dec 7, 2015, at 5:28

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Divya Gehlot
My input format is CSV and I am using Spark 1.3(HDP 2,2 comes with Spark 1.3 so ...) I am using Spark-csv to read my CSV file and using dataframe API to process ... I followed these steps and succesfully able to read

How to use all available memory per worker?

2015-12-07 Thread George Sigletos
Hello, In a 2-worker cluster: 6 cores/30 GB RAM, 24cores/60GB RAM, how can I tell my executor to use all 90 GB of available memory? In the configuration you set e.g. "spark.cores.max" to 30 (24+6), but cannot set "spark.executor.memory" to 90g (30+60). Kind regards, George

Re: how create hbase connect?

2015-12-07 Thread censj
ok! I try it. > 在 2015年12月7日,20:11,ayan guha 写道: > > Kindly take a look https://github.com/nerdammer/spark-hbase-connector > > > On Mon, Dec 7, 2015 at 10:56 PM, censj

Spark and Kafka Integration

2015-12-07 Thread Prashant Bhardwaj
Hi Some Background: We have a Kafka cluster with ~45 topics. Some of topics contains logs in Json format and some in PSV(pipe separated value) format. Now I want to consume these logs using Spark streaming and store them in Parquet format in HDFS. Now my question is: 1. Can we create a

Re: Scala 2.11 and Akka 2.4.0

2015-12-07 Thread RodrigoB
Hi Manas, Thanks for the reply. I've done that. The problem lies with Spark + akka 2.4.0 build. Seems the maven shader plugin is altering some class files and breaking the Akka runtime. Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build errors using sbt due to the issues

how create hbase connect?

2015-12-07 Thread censj
hi all, I want to update row on base. how to create connecting base on Rdd? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: how create hbase connect?

2015-12-07 Thread ayan guha
Kindly take a look https://github.com/nerdammer/spark-hbase-connector On Mon, Dec 7, 2015 at 10:56 PM, censj wrote: > hi all, > I want to update row on base. how to create connecting base on Rdd? > > -

Re: Available options for Spark REST API

2015-12-07 Thread Василец Дмитрий
hello if i correct understand - sparkui with rest api - for monitoring spark-jobserver - for submit job. On Mon, Dec 7, 2015 at 9:42 AM, sunil m <260885smanik...@gmail.com> wrote: > Dear Spark experts! > > I would like to know the best practices used for invoking spark jobs via > REST API. > >

RE: Spark and Kafka Integration

2015-12-07 Thread Singh, Abhijeet
For Q2. The order of the logs in each partition is guaranteed but there cannot be any such thing as global order. From: Prashant Bhardwaj [mailto:prashant2006s...@gmail.com] Sent: Monday, December 07, 2015 5:46 PM To: user@spark.apache.org Subject: Spark and Kafka Integration Hi Some

Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu
refer here: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html of section: Example 4-27. Python custom partitioner > On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote: > > I'm not a python expert, so I'm wondering if anybody has a

Best way to save key-value pair rdd ?

2015-12-07 Thread Anup Sawant
Hello, what would be the best way to save key-value pair rdd so that I don't have to convert the saved record into tuple while reading the rdd back into spark ? -- Best, Anup

Kryo Serialization in Spark

2015-12-07 Thread prasad223
Hi All, I'm unable to use Kryo serializer in my Spark program. I'm loading a graph from an edgelist file using GraphLoader and performing a BFS using pregel API. But I get the below mentioned error while I'm running. Can anybody tell me what is the right way to serialize a class in Spark and what

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin, Maybe you didn't read my post in which I stated that Spark works on top of HDFS. What Jia wants is to have Spark interacts with a C++ process to read and write data. I've never heard about Jia's use case in Spark. If you know one, please share that with me. Thanks On Monday,

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
Hi Annabel I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399.

Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Akhilesh Pathodia
Hi, I am running spark job on yarn in cluster mode in secured cluster. I am trying to run Spark on Hbase using Phoenix, but Spark executors are unable to get hbase connection using phoenix. I am running knit command to get the ticket before starting the job and also keytab file and principal are

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin, To prove my point, this is an unresolved issue still in the implementation stage. On Monday, December 7, 2015 2:49 PM, Robin East wrote: Hi Annabel I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Nick Pentreath
SparkNet may have some interesting ideas - https://github.com/amplab/SparkNet. Haven't had a deep look at it yet but it seems to have some functionality allowing caffe to read data from RDDs, though I'm not certain the memory is shared. — Sent from Mailbox On Mon, Dec 7, 2015 at 9:55 PM,

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
These specific JIRAs don't exist yet, but watch SPARK- as we'll make sure everything shows up there. On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > that's good news about plans to avoid unnecessary conversions, and allow > access to more efficient internal types.

Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar wrote: > > On a similar note, what is involved in getting native support for some > user defined functions, so that they are as efficient as native Spark SQL > expressions? I had one particular one - an arraySum (element

Re: Dataset and lambas

2015-12-07 Thread Koert Kuipers
great thanks On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust wrote: > These specific JIRAs don't exist yet, but watch SPARK- as we'll make > sure everything shows up there. > > On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > >> that's

Re: Dataset and lambas

2015-12-07 Thread Deenar Toraskar
Michael Having VectorUnionSumUDAF implemented would be great. This is quite generic, it does element-wise sum of arrays and maps https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java and would be massive benefit for a lot of risk

Re: Implementing fail-fast upon critical spark streaming tasks errors

2015-12-07 Thread Cody Koeninger
Personally, for jobs that I care about I store offsets in transactional storage rather than checkpoints, which eliminates that problem (just enforce whatever constraints you want when storing offsets). Regarding the question of communication of errors back to the streamingListener, there is an

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-07 Thread Cody Koeninger
Just to be clear, spark checkpoints have nothing to do with zookeeper, they're stored in the filesystem you specify. On Sun, Dec 6, 2015 at 1:25 AM, manasdebashiskar wrote: > When you enable check pointing your offsets get written in zookeeper. If > you > program dies or

RE: How to create dataframe from SQL Server SQL query

2015-12-07 Thread ayan guha
One more thing I feel for better maintability would be to create a dB view and then use the view in spark. This will avoid burying complicated SQL queries within application code. On 8 Dec 2015 05:55, "Wang, Ningjun (LNG-NPV)" wrote: > This is a very helpful article.

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Davies Liu
Could you reproduce this problem in 1.5 or 1.6? On Sun, Dec 6, 2015 at 12:29 AM, YaoPau wrote: > If anyone runs into the same issue, I found a workaround: > df.where('state_code = "NY"') > > works for me. > df.where(df.state_code == "NY").collect() > > fails with

Re: spark sql current time stamp function ?

2015-12-07 Thread sri hari kali charan Tummala
Hi Ted, Gave and exception am I following right approach ? val test=sqlContext.sql("select *, monotonicallyIncreasingId() from kali") On Mon, Dec 7, 2015 at 4:52 PM, Ted Yu wrote: > Have you tried using monotonicallyIncreasingId ? > > Cheers > > On Mon, Dec 7, 2015 at

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
scala> val test=sqlContext.sql("select monotonically_increasing_id() from t").show +---+ |_c0| +---+ | 0| | 1| | 2| +---+ Cheers On Mon, Dec 7, 2015 at 12:48 PM, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi Ted, > > Gave and exception am I following right approach ? > >

Example of a Trivial Custom PySpark Transformer

2015-12-07 Thread Andy Davidson
FYI Hopeful other will find this example helpful Andy Example of a Trivial Custom PySpark Transformer ref: * * NLTKWordPunctTokenizer example * * pyspark.sql.functions.udf

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jian Feng
The only way I can think of is through some kind of wrapper. For java/scala, use JNI. For Python, use extensions. There should not be a lot of work if you know these tools.  From: Robin East To: Annabel Melongo Cc: Jia

Re: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-12-07 Thread Ali Tajeldin EDU
Checkout the Sameer Farooqui video on youtube for spark internals (https://www.youtube.com/watch?v=7ooZ4S7Ay6Y=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX) Starting at 2:15:00, he describes YARN mode. btw, highly recommend the entire video. Very detailed and concise. -- Ali On Dec 7, 2015, at 8:38

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov
How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file. We normally use Parquet, but it should be the same for Avro, so this might be relevant

Re: spark sql current time stamp function ?

2015-12-07 Thread Ted Yu
BTW I forgot to mention that this was added through SPARK-11736 which went into the upcoming 1.6.0 release FYI On Mon, Dec 7, 2015 at 12:53 PM, Ted Yu wrote: > scala> val test=sqlContext.sql("select monotonically_increasing_id() from > t").show > +---+ > |_c0| > +---+ > |

Re: Spark on hbase using Phoenix in secure cluster

2015-12-07 Thread Ruslan Dautkhanov
That error is not directly related to spark nor hbase javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Is this a kerberized cluster? You likely don't have a good (non-expired)

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-07 Thread Jon Gregg
I'm working with a Hadoop distribution that doesn't support 1.5 yet, we'll be able to upgrade in probably two months. For now I'm seeing the same issue with spark not recognizing an existing column name in many hive-table-to-dataframe situations: Py4JJavaError: An error occurred while calling

Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar
By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions.

issue creating pyspark Transformer UDF that creates a LabeledPoint: AttributeError: 'DataFrame' object has no attribute '_get_object_id'

2015-12-07 Thread Andy Davidson
Hi I am running into a strange error. I am trying to write a transformer that takes in to columns and creates a LabeledPoint. I can not figure out why I am getting AttributeError: 'DataFrame' object has no attribute Œ_get_object_id¹ I am using spark-1.5.1-bin-hadoop2.6 Any idea what I am

Re: Managed to make Hive run on Spark engine

2015-12-07 Thread Ashok Kumar
This is great news sir. It shows perseverance pays at last. Can you inform us when the write-up is ready so I can set it up as well please. I know a bit about the advantages of having Hive using Spark engine. However, the general question I have is when one should use Hive on spark as opposed to

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful - WARN server.TransportChannelHandler: Exception in connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at

Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
Hi, I've been running benchmarks against Spark in local mode in a long running process. I'm seeing threads leaking each time it runs a job. It doesn't matter if I recycle SparkContext constantly or have 1 context stay alive for the entire application lifetime. I see a huge accumulation ongoing

Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-07 Thread Deenar Toraskar
Hi I have a process that writes to multiple partitions of the same table in parallel using multiple threads sharing the same SQL context df.write.partitionedBy("partCol").insertInto("tableName") . I am getting FileNotFoundException on _temporary directory. Each write only goes to a single

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Shixiong Zhu
Which version are you using? Could you post these thread names here? Best Regards, Shixiong Zhu 2015-12-07 14:30 GMT-08:00 Richard Marscher : > Hi, > > I've been running benchmarks against Spark in local mode in a long running > process. I'm seeing threads leaking each

Re: Local Mode: Executor thread leak?

2015-12-07 Thread Richard Marscher
Thanks for the response. The version is Spark 1.5.2. Some examples of the thread names: pool-1061-thread-1 pool-1059-thread-1 pool-1638-thread-1 There become hundreds then thousands of these stranded in WAITING. I added logging to try to track the lifecycle of the thread pool in Executor as

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging. I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu > wrote: bq. complete a shuffle stage due to lost executors Have

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards, Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo wrote:

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line: [Stage 4:> (60 + 60) / 200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 10.0.0.138:33822 15/12/07 18:59:40 WARN

Re: How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread swetha kasireddy
OK. I think the following can be used. mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package On Mon, Dec 7, 2015 at 10:13 AM, SRK wrote: > Hi, > > How to do a maven build to enable monitoring using Ganglia? What is the >

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Robin, you have a very good point! We feel that the data copy and allocation overhead may become a performance bottleneck, and is evaluating it right now. We will do the shared memory stuff only if we’re sure about the potential performance gain and sure that there is no existing stuff

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Hi, Kazuaki, It’s very similar with my requirement, thanks! It seems they want to write to a C++ process with zero copy, and I want to do both read/write with zero copy. Any one knows how to obtain more information like current status of this JIRA entry? Best Regards, Jia On Dec 7, 2015,

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Jia
Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages. However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance. But definitely I need to look at Tachyon

SparkSQL AVRO

2015-12-07 Thread Test One
I'm using spark-avro with SparkSQL to process and output avro files. My data has the following schema: root |-- memberUuid: string (nullable = true) |-- communityUuid: string (nullable = true) |-- email: string (nullable = true) |-- firstName: string (nullable = true) |-- lastName: string

  1   2   >