Re: Wrting data from Spark streaming to AWS Redshift?

2016-12-09 Thread ayan guha
Ideally, saving data to external sources should not be any different. give the write options as stated in the bloga shot, but changing mode to append. On Sat, Dec 10, 2016 at 8:25 AM, shyla deshpande wrote: > Hello all, > > Is it possible to Write data from Spark

Unsubscribe

2016-12-09 Thread cjuexuan
Unsubscribe

Re: Random Forest hangs without trace of error

2016-12-09 Thread Md. Rezaul Karim
I had similar experience last week. Even I could not find any error trace. Later on, I did the following to get rid of the problem: i) I downgraded to Spark 2.0.0 ii) Decreased the value of maxBins and maxDepth Additionally, make sure that you set the featureSubsetStrategy as "auto" to let the

[Spark log4j] Turning off log4j while scala program runs on spark-submit

2016-12-09 Thread Irving Duran
Hi - I have a question about log4j while running on spark-submit. I would like to have spark only show errors when I am running spark-submit. I would like to accomplish with without having to edit log4j config file on $SPARK_HOME, is there a way to do this? I found this and it only works on

Random Forest hangs without trace of error

2016-12-09 Thread mhornbech
Hi I have spent quite some time trying to debug an issue with the Random Forest algorithm on Spark 2.0.2. The input dataset is relatively large at around 600k rows and 200MB, but I use subsampling to make each tree manageable. However even with only 1 tree and a low sample rate of 0.05 the job

Re: Spark job server pros and cons

2016-12-09 Thread Shak S
Spark job Server(SJS) gives you the ability to have your spark job as a service. It has features like caching RDD, publish rest APIs to submit your job and named RDDs. For more info, refer https://github.com/spark-jobserver/spark-jobserver. Internally SJS too uses the same spark job submit so it

Spark job server pros and cons

2016-12-09 Thread Cassa L
Hi, So far, I ran spark jobs directly using spark-submit options. I have a use case to use Spark Job server to run the job. I wanted to find out PROS and CONs of using this job server? If anyone can share it, it will be great. My jobs usually connected to multiple data sources like Kafka, Custom

Wrting data from Spark streaming to AWS Redshift?

2016-12-09 Thread shyla deshpande
Hello all, Is it possible to Write data from Spark streaming to AWS Redshift? I came across the following article, so looks like it works from a Spark batch program. https://databricks.com/blog/2015/10/19/introducing-redshift-data-source-for-spark.html I want to write to AWS Redshift from

Document Similarity -Spark Mllib

2016-12-09 Thread satyajit vegesna
Hi ALL, I am trying to implement a mlllib spark job, to find the similarity between documents(for my case is basically home addess). i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only with matrix with thin columns and more rows in matrix. matrix example format, for my

SparkSQL

2016-12-09 Thread Niraj Kumar
Hi I am working on SpqrkSQL using hiveContext (version 1.6.2). Can someone help me to convert following queries in sparkSQL. update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id from samples); insert into details (accnt_call_id, prdct_cd, prdct_id, dtl_pstn) select

Information required

2016-12-09 Thread Rishabh Wadhawan
Does anyone know the repository link for the src of GroupID: org.spark-project.hive Artifact: 1.2.1.spark I was able to find https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 which is artifact 1.2.1.spark2 not 1.2.1.spark. -- View this message in context:

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
I'd say unzip your actual assembly jar and verify whether the kafka consumer classes are 0.10.1 or 0.10.0. We've seen reports of odd behavior with 0.10.1 classes. Possibly unrelated, but good to eliminate. On Fri, Dec 9, 2016 at 10:38 AM, Debasish Ghosh wrote: > oops

Re: how can I set the log configuration file for spark history server ?

2016-12-09 Thread Marcelo Vanzin
(-dev) Just configure your log4j.properties in $SPARK_HOME/conf (or set a custom $SPARK_CONF_DIR for the history server). On Thu, Dec 8, 2016 at 7:20 PM, John Fang wrote: > ./start-history-server.sh > starting org.apache.spark.deploy.history.HistoryServer, logging

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Debasish Ghosh
oops .. it's 0.10.0 .. sorry for the confusion .. On Fri, Dec 9, 2016 at 10:07 PM, Debasish Ghosh wrote: > My assembly contains the 0.10.1 classes .. Here are the dependencies > related to kafka & spark that my assembly has .. > > libraryDependencies ++= Seq( >

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Debasish Ghosh
My assembly contains the 0.10.1 classes .. Here are the dependencies related to kafka & spark that my assembly has .. libraryDependencies ++= Seq( "org.apache.kafka" % "kafka-streams" % "0.10.0.0", "org.apache.spark" %% "spark-streaming-kafka-0-10" % spark,

Re: problem with kafka createDirectStream ..

2016-12-09 Thread Cody Koeninger
When you say 0.10.1 do you mean broker version only, or does your assembly contain classes from the 0.10.1 kafka consumer? On Fri, Dec 9, 2016 at 10:19 AM, debasishg wrote: > Hello - > > I am facing some issues with the following snippet of code that reads from > Kafka

problem with kafka createDirectStream ..

2016-12-09 Thread debasishg
Hello - I am facing some issues with the following snippet of code that reads from Kafka and creates DStream. I am using KafkaUtils.createDirectStream(..) with Kafka 0.10.1 and Spark 2.0.1. // get the data from kafka val stream: DStream[ConsumerRecord[Array[Byte], (String, String)]] =

problem with kafka createDirectStream ..

2016-12-09 Thread Debasish Ghosh
Hello - I am facing some issues with the following snippet of code that reads from Kafka and creates DStream. I am using KafkaUtils.createDirectStream(..) with Kafka 0.10.1 and Spark 2.0.1. // get the data from kafka val stream: DStream[ConsumerRecord[Array[Byte], (String, String)]] =

Re: unit testing in spark

2016-12-09 Thread Michael Stratton
That sounds great, please include me so I can get involved. On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni wrote: > Me too as I spent most of my time writing unit/integ tests pls advise > on where I can start > Kr > > On 9 Dec 2016 12:15 am, "Miguel Morales"

Re: When will Structured Streaming support stream-to-stream joins?

2016-12-09 Thread ljwagerfield
Michael Armbrust's reply: I would guess Spark 2.3, but maybe sooner maybe later depending on demand.  I created https://issues.apache.org/jira/browse/SPARK-18791 so people can describe their requirements / stay informed. --- Lawrence's reply: Please vote on the issue, people! Would be awesome

RE: About transformations

2016-12-09 Thread Mendelson, Assaf
This is a guess but I would bet that most of the time when into the loading of the data. The second time there are many places this could be cached (either by spark or even by the OS if you are reading from file). -Original Message- From: brccosta [mailto:brunocosta@gmail.com]

Re: unit testing in spark

2016-12-09 Thread Marco Mistroni
Me too as I spent most of my time writing unit/integ tests pls advise on where I can start Kr On 9 Dec 2016 12:15 am, "Miguel Morales" wrote: > I would be interested in contributing. Ive created my own library for > this as well. In my blog post I talk about

groupByKey vs reduceByKey

2016-12-09 Thread Appu K
Hi, Read somewhere that groupByKey() in RDD disables map-side aggregation as the aggregation function (appending to a list) does not save any space. However from my understanding, using something like reduceByKey or (CombineByKey + a combiner function,) we could reduce the data shuffled

About transformations

2016-12-09 Thread brccosta
Dear guys, We're performing some tests to evaluate the behavior of transformations and actions in Spark with Spark SQL. In our tests, first we conceive a simple dataflow with 2 transformations and 1 action: LOAD (result: df_1) > SELECT ALL FROM df_1 (result: df_2) > COUNT(df_2) The execution

Few questions on reliability of accumulators value.

2016-12-09 Thread Sudev A C
Hi, Can anyone please help clarity on how accumulators can be used reliably to measure error/success/analytical metrics ? Given below is use case / code snippet that I have. val amtZero = sc.accumulator(0) > val amtLarge = sc.accumulator(0) > val amtNormal = sc.accumulator(0) > val getAmount =

Re: reading data from s3

2016-12-09 Thread Sudev A C
Hi Hitesh, Schema of the table is inferred automatically if you are reading from JSON file, wherein when you are reading from a text file you will have to provide a schema for the table you want to create (JSON has schema within it). You can create a data frames and register them as tables. 1.