Spark K means number of Iterations?

2015-09-12 Thread ashensw
H all, I want to know whether the K means algorithm stops if the data set converges to stable clusters before reaching the number of iteration that we defined? As an example if I give 100 as the number of iterations and 5 as the number of clusters, but if the data set converges to stable 5 clust

Re: SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-12 Thread Jörn Franke
I am not sure what are you trying to achieve here. Have you thought about using flume? Additionally maybe something like rsync? Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar a écrit : > Hi all, >I have a coded a custom receiver which receives kafka messages. These > Kafka messages have FTP

Re: Spark K means number of Iterations?

2015-09-12 Thread Robineast
Yes it does stop if the algorithm converges in less than the specified tolerance. You have a parameter to the Kmeans constructor called epsilon which defaults to 1e4. If you have logging set to INFO you will get a log message telling you how many iterations were run ---

Re: change the spark version

2015-09-12 Thread Sean Owen
This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 has 1.5. The best thing is to update CDH as a whole if you can. However it's pretty simple to just run a newer Spark assembly as a YARN app. Don't remove anything in the CDH installation. Try downloading the assembly and prefixing

Re: about mr-style merge sort

2015-09-12 Thread 周千昊
Hi, Shao & Pendey After repartition and sort within the partition, the application running on Spark is now faster than on MR. I will try to run it on a much larger dataset for benchmark. Thanks again for the guidance. 周千昊 于2015年9月11日周五 下午1:35写道: > Hi, Shao & Pendey > Thanks for

Re: Implement "LIKE" in SparkSQL

2015-09-12 Thread liam
OK, I got another way, it looks silly and low inefficiency but works. tradeDF.registerTempTable(tradeTab); orderDF.registerTempTable(orderTab); //orderId = tid + "_x" String sql1 = "select * from " + tradeTab + " a, " + orderTab + " b where substr(b.orderId,1,15) = substr(a.tid,1) "; String sq

Re: Help with collect() in Spark Streaming

2015-09-12 Thread Luca
I am trying to implement an application that requires the output to be aggregated and stored as a single txt file to HDFS (instead of, for instance, having 4 different txt files coming from my 4 workers). The solution I used does the trick, but I can't tell if it's ok to regularly stress one of the

How to create broadcast variable from Java String array?

2015-09-12 Thread unk1102
Hi I have Java String array which contains 45 string which is basically Schema String[] fieldNames = {"col1","col2",...}; Currently I am storing above array of String in a driver static field. My job is running slow so trying to refactor code I am using String array in creating DataFrame DataFr

RE: Training the MultilayerPerceptronClassifier

2015-09-12 Thread Rory Waite
Thanks Feynman, that is useful. I am interested in comparing the Spark MLP with Caffe. If I understand it correctly the changes to the Spark MLP API now restricts the functionality such that -Spark restricts the top layer to be a softmax -Can only use LBFGS to train the network I think this be

How do debug YARN client OOM issue?

2015-09-12 Thread unk1102
Hi my Spark job runs fine for one hour or so then it starts loosing executors because OOM I want to debug this issue where is it causing OOM I heard we can use VisualVM,JCONSOLE etc how do we use in Spark I am using YARN client mode to submit my Spark job? How do I set JMX param? I mean it should p

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs ordinari

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs ordinari

Re: Spark K means number of Iterations?

2015-09-12 Thread ashensw
Okay. Thanks for the help. On Sat, Sep 12, 2015 at 1:16 PM, Robineast [via Apache Spark User List] < ml-node+s1001560n24665...@n3.nabble.com> wrote: > Yes it does stop if the algorithm converges in less than the specified > tolerance. You have a parameter to the Kmeans constructor called epsilon

Spark Streaming..Exception

2015-09-12 Thread Priya Ch
Hello All, When I push messages into kafka and read into streaming application, I see the following exception- I am running the application on YARN and no where broadcasting the message within the application. Just simply reading message, parsing it and populating fields in a class and then prin

What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Hi everyone, What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. (I know we have MLlib. But the existing code base is big, and some functions are not fully supporte

UDAF and UDT with SparkSQL 1.5.0

2015-09-12 Thread jussipekkap
Hi, Issue #1: I'm using the new UDAF interface (UserDefinedAggregateFunction) at Spark 1.5.0 release. Is it possible to aggregate all values in the MutableAggregationBuffer into an array in a robust manner? I'm creating an aggregation function that collects values into an array from all input rows

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Jörn Franke
I fear you have to do the plumbing all yourself. This is the same for all commercial and non-commercial libraries/analytics packages. It often also depends on the functional requirements on how you distribute. Le sam. 12 sept. 2015 à 20:18, Rex X a écrit : > Hi everyone, > > What is the best way

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Nick Pentreath
You might want to check out https://github.com/lensacom/sparkit-learn Though it's true for random Forests / trees you will need to use MLlib — Sent from Mailbox On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke wrote: > I fear you have to do the plumbing all yourself. This is the same for al

Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread unk1102
HiveContext hiveContext = createHiveContext(sc); declareHiveUDFs(hiveContext); DateTimeFormatter df = DateTimeFormat.forPattern("MMdd"); String yestday = "20150912"; hiveContext.sql(" use xyz "); create

Re: Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread Richard W. Eggert II
pConfiguration()); HiveContext hiveContext = createHiveContext(sc); declareHiveUDFs(hiveContext); DateTimeFormatter df = DateTimeFormat.forPattern("MMdd"); String yestday = "20150912"; hiveContext.sql(" use xyz "); createT

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Jorn and Nick, Thanks for answering. Nick, the sparkit-learn project looks interesting. Thanks for mentioning it. Rex On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath wrote: > You might want to check out https://github.com/lensacom/sparkit-learn >

Re: Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread Umesh Kacha
setSparkConfProperties(sparkConf); > SparkContext sc = new SparkContext(sparkConf); > final FileSystem fs = FileSystem.get(sc.hadoopConfiguration()); > HiveContext hiveContext = createHiveContext(sc); > > declareHiveUDFs(hiveContex

RDD transformation and action running out of memory

2015-09-12 Thread Utkarsh Sengar
I am trying to run this, a basic mapToPair and then count() to trigger an action. 4 executors are launched but I don't see any relevant logs on those executors. It looks like the the driver is pulling all the data and it runs out of memory, the dataset is big, so it won't fit on 1 machine. So wha

Re: RDD transformation and action running out of memory

2015-09-12 Thread Richard Eggert
Hmm... The count() method invokes this: def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.length) } It appears that you're running out of memory while trying to compute (within the driver) the number of partitions that will b

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs ordinarily

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-12 Thread Jagat Singh
Hi Davies, This was first query on new version. The one which ran successfully was Spark Pi example ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-client \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \

Re: Spark 1.5.0 java.lang.OutOfMemoryError: PermGen space

2015-09-12 Thread Jagat Singh
Sorry to answer your question fully. The job starts tasks and few of them fail and some are successful. The failed one have that PermGen error in logs. But ultimately full job is marked fail and session quits. On Sun, Sep 13, 2015 at 10:48 AM, Jagat Singh wrote: > Hi Davies, > > This was firs

What happens when cache is full?

2015-09-12 Thread Hemminger Jeff
I am trying to understand the process of caching and specifically what the behavior is when the cache is full. Please excuse me if this question is a little vague, I am trying to build my understanding of this process. I have an RDD that I perform several computations with, I persist it with IN_ME

Cogrouping data in dataframes - PairRDD cogroup vs. join - best practices

2015-09-12 Thread Matthew Denny
I was wondering what are the best practices in regards to cogrouping data in dataframes. While joins are obviously a powerful tool, it seems that there's still some cases where using cogroup (which is only supported by PairRDDs) is still a better choice. Consider 2 case classes with the following

Limiting number of cores per job in multi-threaded driver.

2015-09-12 Thread Philip Weaver
I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR scheduler, so I can define a long-running application capable of executing multiple simultaneous spark jobs. The kind of jobs that I'm running do not benefit from more than 4 cores, but I want my application to be able to tak

Re: [Question] ORC - EMRFS Problem

2015-09-12 Thread Owen O'Malley
Do you have a stack trace of the array out of bounds exception? I don't remember an array out of bounds problem off the top of my head. A stack trace will tell me a lot, obviously. If you are using Spark 1.4 that implies Hive 0.13, which is pretty old. It may be a problem that we fixed a while ago