Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Sean Owen
Are you certain? looks like it was correct in the release: https://github.com/apache/spark/blob/v1.6.2/core/src/main/scala/org/apache/spark/package.scala On Mon, Jul 25, 2016 at 12:33 AM, Ascot Moss wrote: > Hi, > > I am trying to upgrade spark from 1.6.1 to 1.6.2, from

unsubscribe

2016-07-24 Thread Uzi Hadad

unsubscribe)

2016-07-24 Thread Uzi Hadad

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath
Good suggestion Krishna One issue is that this doesn't work with TrainValidationSplit or CrossValidator for parameter tuning. Hence my solution in the PR which makes it work with the cross-validators. On Mon, 25 Jul 2016 at 00:42, Krishna Sankar wrote: > Thanks Nick. I

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin
hi,all : I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I got error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ at org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57) at

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot
Hi, I am getting below error when I am trying to save dataframe using Spark-CSV > > final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path) java.lang.NoSuchMethodError: > scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; > at >

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich
You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType) Here is an example on how to define the StructType (schema) that you will combine with

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Rohit Chaddha
Great thanks both of you. I was struggling with this issue as well. -Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar wrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich
You can use the .repartition() function on the rdd or dataframe to set the number of partitions higher. Use .partitions.length to get the current number of partitions. (Scala API). Andrew > On Jul 24, 2016, at 4:30 PM, Ascot Moss wrote: > > the data set is the training

Bzip2 to Parquet format

2016-07-24 Thread janardhan shetty
We have data in Bz2 compression format. Any links in Spark to convert into Parquet and also performance benchmarks and uses study materials ?

K-means Evaluation metrics

2016-07-24 Thread janardhan shetty
Hi, I was trying to evaluate k-means clustering prediction since the exact cluster numbers were provided before hand for each data point. Just tried the Error = Predicted cluster number - Given number as brute force method. What are the evaluation metrics available in Spark for K-means

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Thanks Marco. This solved the order problem. Had another question which is prefix to this. As you can see below ID2,ID1 and ID3 are in order and I need to maintain this index order as well. But when we do groupByKey operation(*rdd.distinct.groupByKey().mapValues(v => v.toArray*)) everything is

Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Ascot Moss
Hi, I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I found the version is still displayed 1.6.1 Is this a minor typo/bug? Regards ### Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Ascot Moss
the data set is the training data set for random forest training, about 36,500 data, any idea how to further partition it? On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich wrote: > It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which > limits the

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath
It seems likely that you're running into https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the test dataset in the train/test split contains users or items that were not in the training set. Hence the model doesn't have computed factors for those ids, and ALS 'transform'

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
Hello Uhm you have an array containing 3 tuples? If all the arrays have same length, you can just zip all of them, creatings a list of tuples then you can scan the list 5 by 5...? so something like (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList this will give you a list of tuples of 3

Outer Explode needed

2016-07-24 Thread Don Drake
I have a nested data structure (array of structures) that I'm using the DSL df.explode() API to flatten the data. However, when the array is empty, I'm not getting the rest of the row in my output as it is skipped. This is the intended behavior, and Hive supports a SQL "OUTER explode()" to

Re: Spark driver getting out of memory

2016-07-24 Thread Raghava Mutharaju
Saurav, We have the same issue. Our application runs fine on 32 nodes with 4 cores each and 256 partitions but gives an OOM on the driver when run on 64 nodes with 512 partitions. Did you get to know the reason behind this behavior or the relation between number of partitions and driver RAM

Re: UDF to build a Vector?

2016-07-24 Thread Marco Mistroni
Hi what is your source data? i am guessing a DataFrame or Integers as you are usingan UDF So your DataFrame is then a bunch of Row[Integer] ? below a sample from one of my code to predict eurocup winners , going from a DataFrame of Row[Double] to a RDD of LabeledPoint I m not using UDF to

Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread janardhan shetty
Is there any implementation of FPGrowth and Association rules in Spark Dataframes ? We have in RDD but any pointers to Dataframes ?

Re: spark context stop vs close

2016-07-24 Thread Sean Owen
I think this is about JavaSparkContext which implements the standard Closeable interface for convenience. Both do exactly the same thing. On Sun, Jul 24, 2016 at 6:27 PM, Jacek Laskowski wrote: > Hi, > > I can only find stop. Where did you find close? > > Pozdrawiam, > Jacek

How to read content of hdfs files

2016-07-24 Thread Bhupendra Mishra
I have hdfs data in zip formate which includes data, name and nameseconday folder. Pretty much structure is like datanode, name node and secondary node. How to read the content of data. would be great if some can suggest tips/steps. Thanks

Re: spark context stop vs close

2016-07-24 Thread Jacek Laskowski
Hi, I can only find stop. Where did you find close? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Jul 23, 2016 at 3:11 PM, Mail.com

Spark 2.0.0 RC 5 -- java.lang.AssertionError: assertion failed: Block rdd_[*] is not locked for reading

2016-07-24 Thread Ameen Akel
Hello, I'm working with Spark 2.0.0-rc5 on Mesos (v0.28.2) on a job with ~600 cores. Every so often, depending on the task that I've run, I'll lose an executor to an assertion. Here's an example error: java.lang.AssertionError: assertion failed: Block rdd_2659_0 is not locked for reading I've

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Array( (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272, 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076, 45431, 100136)), (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244, 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318,

java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin
I try to build a simple DataFrame that can be used for ML SparkConf conf = new SparkConf().setAppName("Simple prediction from Text File").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext = new SQLContext(sc);

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin
Hi, Here is my UDF that should build a VectorUDT. How do I actually make that the value is in the vector? package net.jgp.labs.spark.udf; import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.sql.api.java.UDF1; public class VectorBuilder implements UDF1

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread VG
ping. Anyone has some suggestions/advice for me . It will be really helpful. VG On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error

Restarting Spark Streaming Job periodically

2016-07-24 Thread Prashant verma
Hi All, I want to restart my spark streaming job periodically after every 15 mins using JAVA. Is it possible and if yes, how should i proceed. Thanks, Prashant

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
Apologies I misinterpreted could you post two use cases? Kr On 24 Jul 2016 3:41 pm, "janardhan shetty" wrote: > Marco, > > Thanks for the response. It is indexed order and not ascending or > descending order. > On Jul 24, 2016 7:37 AM, "Marco Mistroni"

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Marco, Thanks for the response. It is indexed order and not ascending or descending order. On Jul 24, 2016 7:37 AM, "Marco Mistroni" wrote: > Use map values to transform to an rdd where values are sorted? > Hth > > On 24 Jul 2016 6:23 am, "janardhan shetty"

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Marco Mistroni
Hi how bout creating an auto increment column in hbase? Hth On 24 Jul 2016 3:53 am, "yeshwanth kumar" wrote: > Hi, > > i am doing bulk load to hbase using spark, > in which i need to generate a sequential key for each record, > the key should be sequential across all the

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links

Re: Locality sensitive hashing

2016-07-24 Thread Karl Higley
Hi Janardhan, I collected some LSH papers while working on an RDD-based implementation. Links at the end of the README here: https://github.com/karlhigley/spark-neighbors Keep me posted on what you come up with! Best, Karl On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty

Locality sensitive hashing

2016-07-24 Thread janardhan shetty
I was looking through to implement locality sensitive hashing in dataframes. Any pointers for reference?

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang : > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-spark

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-24 Thread Julien Nauroy
Hi again, Just another strange behavior I stumbled upon. Can anybody reproduce it? Here's the code snippet in scala: var df1 = spark.read.parquet(fileName) df1 = df1.withColumn("newCol", df1.col("anyExistingCol")) df1.printSchema() // here newCol exists df1 = df1.flatMap(x => List(x))

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Pedro Rodriguez
If you can use a dataframe then you could use rank + window function at the expense of an extra sort. Do you have an example of zip with index not working, that seems surprising. On Jul 23, 2016 10:24 PM, "Andrew Ehrlich" wrote: > It’s hard to do in a distributed system.

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
Hi Gourav, I can not reproduce your problem. The following code snippets works well on my local machine, you can try to verify it in your environment. Or could you provide more information to make others can reproduce your problem? from pyspark.mllib.linalg.distributed import CoordinateMatrix,

Re: NoClassDefFoundError with ZonedDateTime

2016-07-24 Thread Timur Shenkao
Which version of Java 8 do you use? AFAIK, it's recommended to exploit Java 1.8_0.66 + On Fri, Jul 22, 2016 at 8:49 PM, Jacek Laskowski wrote: > On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu wrote: > > You can use this command (assuming log aggregation is turned

Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Sean Owen
Also also, you may be interested in GATK, built on Spark, for genomics: https://github.com/broadinstitute/gatk On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor wrote: > Hi James, > BTW - if you are into analyzing DNA with Spark, you may also be interested > in ADAM: >

Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Ofir Manor
Hi James, BTW - if you are into analyzing DNA with Spark, you may also be interested in ADAM: https://github.com/bigdatagenomics/adam http://bdgenomics.org/ Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Fri, Jul 22, 2016 at 10:31 PM,