Re: Integration tests for Spark Streaming

2016-06-28 Thread Luciano Resende
This thread might be useful for what you want: https://www.mail-archive.com/user%40spark.apache.org/msg34673.html On Tue, Jun 28, 2016 at 1:25 PM, SRK wrote: > Hi, > > I need to write some integration tests for my Spark Streaming app. Any > example on how to do this

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
Are you fitting the VectorIndexer to the entire data set and not just training or test data? If you are able to post your code and some data to reproduce, that would help in troubleshooting. On Tue, Jun 28, 2016 at 4:40 PM, Rich Tarro wrote: > Thanks for the response, but

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Actually I was writing a code for the Connected Components algorithm. In that I have to keep track of a variable called vertex number which keeps on getting incremented based on the number of triples it encounters in a line. This variable should be available at all the nodes and all the

Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark 2.0. In a couple of projects including CaffeOnSpark the Logging class was simply copied to the new project to allow for backwards compatibility. 2016-06-28 18:10 GMT-07:00 Michael Armbrust : > I'd

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Timur Shenkao
Hi, guys! As far as I remember, Spark does not use all peculiarities and optimizations of ORC. Moreover, the possibility to read ORC files appeared not so long time ago in Spark. So, despite "victorious" results announced in http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ,

Re: Logging trait in Spark 2.0

2016-06-28 Thread Michael Armbrust
I'd suggest using the slf4j APIs directly. They provide a nice stable API that works with a variety of logging backends. This is what Spark does internally. On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno wrote: > Yes ... the same here ... I'd like to know the best way for

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
Since the data.length is variable, I am not sure whether mixing data.length and the index makes sense. Can you describe your use case in bit more detail ? Thanks On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik wrote: > Hi Ted > > So would the tuple look like: (x._1,

Re: Random Forest Classification

2016-06-28 Thread Rich Tarro
Thanks for the response, but in my case I reversed the meaning of "prediction" and "predictedLabel". It seemed to make more sense to me that way, but in retrospect, it probably only causes confusion to anyone else looking at this. I reran the code with all the pipeline stage inputs and outputs

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
The problem might be that you are evaluating with "predictionLabel" instead of "prediction", where predictionLabel is the prediction index mapped to the original label strings - at least according to the RandomForestClassifierExample, not sure if your code is exactly the same. On Tue, Jun 28,

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
This is what I am getting in the container log for mr 2016-06-28 23:25:53,808 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS

Spark SQL concurrent runs fails with java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

2016-06-28 Thread Jesse F Chen
With the Spark 2.0 build from 0615, when running 4-user concurrent SQL tests against Spark SQL on 1TB TPCDS, we are seeing consistently the following exceptions: 10:35:33 AM: 16/06/27 23:40:37 INFO scheduler.TaskSetManager: Finished task 412.0 in stage 819.0 (TID 270396) in 8468 ms on

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
That is a good point. The ORC table property is as follows TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.stripe.size"="268435456", "orc.row.index.stride"="1") which puts each stripe at 256MB Just to clarify this is spark running on Hive tables. I don't think the use of TEZ, MR or Spark as

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Jörn Franke
Bzip2 is splittable for text files. Btw in Orc the question of splittable does not matter because each stripe is compressed individually. Have you tried tez? As far as I recall (at least it was in the first version of Hive) mr uses for order by a single reducer which is a bottleneck. Do you

Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
Hi, I have a simple join between table sales2 a compressed (snappy) ORC with 22 million rows and another simple table sales_staging under a million rows stored as a text file with no compression. The join is very simple val s2 = HiveContext.table("sales2").select("PROD_ID") val s =

Integration tests for Spark Streaming

2016-06-28 Thread SRK
Hi, I need to write some integration tests for my Spark Streaming app. Any example on how to do this would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integration-tests-for-Spark-Streaming-tp27246.html Sent from the

Random Forest Classification

2016-06-28 Thread Rich Tarro
I created a ML pipeline using the Random Forest Classifier - similar to what is described here except in my case the source data is in csv format rather than libsvm. https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier I am able to successfully train

Re: Best practice for handing tables between pipeline components

2016-06-28 Thread Everett Anderson
Thanks! Alluxio looks quite promising, but also quite new. What did people do before? On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang wrote: > Yes, Alluxio (http://www.alluxio.org/) can be used to store data > in-memory between stages in a pipeline. > > Here is more

Re: Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
Thank you Xinh. That's what I need. Le mar. 28 juin 2016 à 17:43, Xinh Huynh a écrit : > Hi Jao, > > Here's one option: > http://spark.apache.org/docs/latest/ml-features.html#stringindexer > "StringIndexer encodes a string column of labels to a column of label > indices.

Need help with spark GraphiteSink

2016-06-28 Thread Vijay Vangapandu
Hi, I need help resolving issue with spark GraphiteSink. I am trying to use graphite sink, but i have no luck. Here are the details. spark version is 1.4 and i am passing below 2 arguments to spark-submit job in yarn cluster mode. --files=/data/svc/metrics/metrics.properties --conf

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi Ted So would the tuple look like: (x._1, split.startIndex + x._2 + x._1.length) ? On Tue, Jun 28, 2016 at 11:09 PM, Ted Yu wrote: > Please take a look at: > core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala > > In compute() method: > val split =

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala In compute() method: val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition] firstParent[T].iterator(split.prev, context).zipWithIndex.map { x => (x._1, split.startIndex + x._2) You can

Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Punit Naik
Hi I wanted to change the functioning of the "zipWithIndex" function for spark RDDs in which the output of the function is, just for an example, "(data, prev_index+data.length)" instead of "(data,prev_index+1)". How can I do this? -- Thank You Regards Punit Naik

Re: Set the node the spark driver will be started

2016-06-28 Thread Mich Talebzadeh
Hi Felix, In Yarn-cluster mode the resource manager Yarn is expected to take care of that. Are you getting some skewed distribution with drivers created through spark-submit on different nodes? HTH Dr Mich Talebzadeh LinkedIn *

Re: Best way to tranform string label into long label for classification problem

2016-06-28 Thread Xinh Huynh
Hi Jao, Here's one option: http://spark.apache.org/docs/latest/ml-features.html#stringindexer "StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies." Xinh On Tue, Jun 28, 2016 at 12:29 AM, Jaonary Rabarisoa

Re: Set the node the spark driver will be started

2016-06-28 Thread Felix Massem
Hey Mich, thx for the fast reply. We are using it in cluster mode and spark version 1.5.2 Greets Felix Felix Massem | IT-Consultant | Karlsruhe mobil: +49 (0) 172.2919848 <> www.codecentric.de | blog.codecentric.de |

Re: Set the node the spark driver will be started

2016-06-28 Thread Mich Talebzadeh
Hi Felix, what version of Spark? Are you using yarn client mode or cluster mode? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Set the node the spark driver will be started

2016-06-28 Thread adaman79
Hey guys, I have a problem with memory because over 90% of my spark driver will be started on one of my nine spark nodes. So now I am looking for the possibility to define the node the spark driver will be started when using spark-submit or setting it somewhere in the code. Is this possible?

Re: Spark master shuts down when one of zookeeper dies

2016-06-28 Thread Ted Yu
Please see some blog w.r.t. the number of nodes in the quorum: http://stackoverflow.com/questions/13022244/zookeeper-reliability-three-versus-five-nodes http://www.ibm.com/developerworks/library/bd-zookeeper/ the paragraph starting with 'A quorum is represented by a strict majority of nodes'

Re: How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Dhaval Patel
Did you try implementing MultipleTextOutputFormat and use SaveAsHadoopFile with keyClass, valueClass and OutputFormat instead of default parameters? You need to implement toString for your keyClass and ValueClass inorder to get field separator other than defaults. Regards Dhaval On Tue, Jun

Issue with Spark on 25 nodes cluster

2016-06-28 Thread ANDREA SPINA
Hello everyone, I am running some experiments with Spark 1.4.0 on a ~80GiB dataset located on hdfs-2.7.1. The environment is a 25 nodes cluster, 16 cores per node. I set the following params: spark.master = "spark://"${runtime.hostname}":7077" # 28 GiB of memory spark.executor.memory = "28672m"

Re: Restart App and consume from checkpoint using direct kafka API

2016-06-28 Thread vimal dinakaran
I have implemented the above approach with cassandra db. Thank you all. On Thu, Mar 31, 2016 at 8:26 PM, Cody Koeninger wrote: > Long story short, no. Don't rely on checkpoints if you cant handle > reprocessing some of your data. > > On Thu, Mar 31, 2016 at 3:02 AM, Imre

Spark master shuts down when one of zookeeper dies

2016-06-28 Thread vimal dinakaran
I am using zookeeper for providing HA for spark cluster. We have two nodes zookeeper cluster. When one of the zookeeper dies then the entire spark cluster goes down . Is this expected behaviour ? Am I missing something in config ? Spark version - 1.6.1. Zookeeper version - 3.4.6 //

Re: DataFrame versus Dataset creation and usage

2016-06-28 Thread Martin Serrano
Xinh, Thanks for the clarification. I'm new to Spark and trying to navigate the different APIs. I was just following some examples and retrofitting them, but I see now I should stick with plain RDDs until my schema is known (at the end of the data pipeline). Thanks again! On 06/24/2016

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Radha krishna
Hi, i have some files in the hdfs with FS as field separator and RS as record separator, i am able to read the files and able to process successfully. how can i write the spark DataFrame result into the HDFS file with same delimeters (FS as field separator and RS as record separator instead of \n)

Create JavaRDD from list in Spark 2.0

2016-06-28 Thread Rafael Caballero
Hi, The usual way of creating a JavaRDD from a list is to use JavaSparkContext.parallelize(List) However, in Spark 2.0 SparkSession is used as entry point and I don't know how to create a JavaRDD from a List. Is this possible? Thanks and best regards, Rafael Caballero -- View this message

Re: Substract two DStreams

2016-06-28 Thread Marius Soutier
Sure, no problem. > On 28.06.2016, at 08:57, Matthias Niehoff > wrote: > > ah, didn't know about this. That might actually work. I solved it by > implementing the leftJoinWithCassandraTable by myself which is nearly as fast > as the normal join. This should

Best way to tranform string label into long label for classification problem

2016-06-28 Thread Jaonary Rabarisoa
Dear all, I'm trying to a find a way to transform a DataFrame into a data that is more suitable for third party classification algorithm. The DataFrame have two columns : "feature" represented by a vector and "label" represented by a string. I want the "label" to be a number between [0, number of