[jira] [Commented] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()
[ https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960562#comment-16960562 ] Peter Nijem commented on SPARK-27842: - Hi Sean Thanks for your reply. I have attached for you a Main class and an input file so you will be able to reproduce the issue and see the differences. I have also added some guidelines to how to test it. Do you want me to reproduce it and share with you the results? I am working with Spark Java API in local mode (1 node, 8 cores). Spark version as follows in my pom.xml: *MLLib* _spark-mllib_2.11_ _2.3.1_ *Core* _spark-core_2.11_ _2.3.1_ > Inconsistent results of Statistics.corr() and > PearsonCorrelation.computeCorrelationMatrix() > --- > > Key: SPARK-27842 > URL: https://issues.apache.org/jira/browse/SPARK-27842 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, Windows >Affects Versions: 2.3.1 >Reporter: Peter Nijem >Priority: Major > Attachments: vectorList.txt > > > Hi, > I am working with Spark Java API in local mode (1 node, 8 cores). Spark > version as follows in my pom.xml: > *MLLib* > _spark-mllib_2.11_ > _2.3.1_ > *Core* > _spark-core_2.11_ > _2.3.1_ > I am experiencing inconsistent results of correlation when starting my Spark > application with 8 cores vs 1/2/3 cores. > I've created a Main class which reads from a file a list of Vectors; 240 > Vector which each Vector is of the length of 226. > As you can see, I am firstly initializing Spark with local[*], running > Correlation, saving the Matrix and stopping Spark. Then, I do the same but > for local[3]. > I am expecting to get the same matrices on both runs. But this is not the > case. The input file is attached. > I tried to compute the correlation using > PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here > as well. > > In my work, I am dependent on the resulting correlation matrix. Thus, I am > experiencing bad results in y application due to the inconsistent results I > am getting. As a workaround, I am working now with local[1] > > > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Matrix; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.stat.Statistics; > import org.apache.spark.rdd.RDD; > import org.junit.Assert; > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > import java.math.RoundingMode; > import java.text.DecimalFormat; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > import java.util.stream.Collectors; > public class TestSparkCorr { > private static JavaSparkContext ctx; > public static void main(String[] args) { > List> doublesLists = readInputFile(); > List resultVectors = getVectorsList(doublesLists); > //=== > initSpark("*"); > RDD RDD_vector = ctx.parallelize(resultVectors).rdd(); > Matrix m = Statistics.corr(RDD_vector, "pearson"); > stopSpark(); > //=== > initSpark("3"); > RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd(); > Matrix m3 = Statistics.corr(RDD_vector_3, "pearson"); > stopSpark(); > //=== > Assert.assertEquals(m3, m); > } > private static List getVectorsList(List> doublesLists) { > List resultVectors = new ArrayList<>(doublesLists.size()); > for (List vector : doublesLists) { > double[] x = new double[vector.size()]; > for(int i = 0; i < x.length; i++){ > x[i] = vector.get(i); > } > resultVectors.add(new DenseVector(x)); > } > return resultVectors; > } > private static List> readInputFile() { > List> doublesLists = new ArrayList<>(); > try (BufferedReader reader = new BufferedReader(new FileReader( > ".//output//vectorList.txt"))) { > String line = reader.readLine(); > while (line != null) { > String[] splitLine = line.substring(1, line.length() - 2).split(","); > List doubles = Arrays.stream(splitLine).map(x -> > Double.valueOf(x.trim())).collect(Collectors.toList()); > doublesLists.add(doubles); > // read next line > line = reader.readLine(); > } > } catch (IOException e) { > e.printStackTrace(); > } > return doublesLists; > } > private static void initSpark(String coresNum) { > final SparkConf sparkConf = new SparkConf().setAppName("Span"); > sparkConf.setMaster(String.format("local[%s]", coresNum)); > sparkConf.set("spark.ui.enabled", "false"); > ctx = new JavaSparkContext(sparkConf); > } > private static void stopSpark() { > ctx.stop(); >
[jira] [Commented] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()
[ https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884954#comment-16884954 ] Peter Nijem commented on SPARK-27842: - Hi [~hyukjin.kwon] Is there any update on this issue? Thanks, Peter > Inconsistent results of Statistics.corr() and > PearsonCorrelation.computeCorrelationMatrix() > --- > > Key: SPARK-27842 > URL: https://issues.apache.org/jira/browse/SPARK-27842 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, Windows >Affects Versions: 2.3.1 >Reporter: Peter Nijem >Priority: Major > Attachments: vectorList.txt > > > Hi, > I am working with Spark Java API in local mode (1 node, 8 cores). Spark > version as follows in my pom.xml: > *MLLib* > _spark-mllib_2.11_ > _2.3.1_ > *Core* > _spark-core_2.11_ > _2.3.1_ > I am experiencing inconsistent results of correlation when starting my Spark > application with 8 cores vs 1/2/3 cores. > I've created a Main class which reads from a file a list of Vectors; 240 > Vector which each Vector is of the length of 226. > As you can see, I am firstly initializing Spark with local[*], running > Correlation, saving the Matrix and stopping Spark. Then, I do the same but > for local[3]. > I am expecting to get the same matrices on both runs. But this is not the > case. The input file is attached. > I tried to compute the correlation using > PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here > as well. > > In my work, I am dependent on the resulting correlation matrix. Thus, I am > experiencing bad results in y application due to the inconsistent results I > am getting. As a workaround, I am working now with local[1] > > > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Matrix; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.stat.Statistics; > import org.apache.spark.rdd.RDD; > import org.junit.Assert; > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > import java.math.RoundingMode; > import java.text.DecimalFormat; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > import java.util.stream.Collectors; > public class TestSparkCorr { > private static JavaSparkContext ctx; > public static void main(String[] args) { > List> doublesLists = readInputFile(); > List resultVectors = getVectorsList(doublesLists); > //=== > initSpark("*"); > RDD RDD_vector = ctx.parallelize(resultVectors).rdd(); > Matrix m = Statistics.corr(RDD_vector, "pearson"); > stopSpark(); > //=== > initSpark("3"); > RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd(); > Matrix m3 = Statistics.corr(RDD_vector_3, "pearson"); > stopSpark(); > //=== > Assert.assertEquals(m3, m); > } > private static List getVectorsList(List> doublesLists) { > List resultVectors = new ArrayList<>(doublesLists.size()); > for (List vector : doublesLists) { > double[] x = new double[vector.size()]; > for(int i = 0; i < x.length; i++){ > x[i] = vector.get(i); > } > resultVectors.add(new DenseVector(x)); > } > return resultVectors; > } > private static List> readInputFile() { > List> doublesLists = new ArrayList<>(); > try (BufferedReader reader = new BufferedReader(new FileReader( > ".//output//vectorList.txt"))) { > String line = reader.readLine(); > while (line != null) { > String[] splitLine = line.substring(1, line.length() - 2).split(","); > List doubles = Arrays.stream(splitLine).map(x -> > Double.valueOf(x.trim())).collect(Collectors.toList()); > doublesLists.add(doubles); > // read next line > line = reader.readLine(); > } > } catch (IOException e) { > e.printStackTrace(); > } > return doublesLists; > } > private static void initSpark(String coresNum) { > final SparkConf sparkConf = new SparkConf().setAppName("Span"); > sparkConf.setMaster(String.format("local[%s]", coresNum)); > sparkConf.set("spark.ui.enabled", "false"); > ctx = new JavaSparkContext(sparkConf); > } > private static void stopSpark() { > ctx.stop(); > if(ctx.sc().isStopped()){ > ctx = null; > } > } > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()
[ https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862739#comment-16862739 ] Peter Nijem commented on SPARK-27842: - [~hyukjin.kwon] do you need another input from my side? If so, please let me know and I will be glad to help. Peter > Inconsistent results of Statistics.corr() and > PearsonCorrelation.computeCorrelationMatrix() > --- > > Key: SPARK-27842 > URL: https://issues.apache.org/jira/browse/SPARK-27842 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, Windows >Affects Versions: 2.3.1 >Reporter: Peter Nijem >Priority: Major > Attachments: vectorList.txt > > > Hi, > I am working with Spark Java API in local mode (1 node, 8 cores). Spark > version as follows in my pom.xml: > *MLLib* > _spark-mllib_2.11_ > _2.3.1_ > *Core* > _spark-core_2.11_ > _2.3.1_ > I am experiencing inconsistent results of correlation when starting my Spark > application with 8 cores vs 1/2/3 cores. > I've created a Main class which reads from a file a list of Vectors; 240 > Vector which each Vector is of the length of 226. > As you can see, I am firstly initializing Spark with local[*], running > Correlation, saving the Matrix and stopping Spark. Then, I do the same but > for local[3]. > I am expecting to get the same matrices on both runs. But this is not the > case. The input file is attached. > I tried to compute the correlation using > PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here > as well. > > In my work, I am dependent on the resulting correlation matrix. Thus, I am > experiencing bad results in y application due to the inconsistent results I > am getting. As a workaround, I am working now with local[1] > > > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Matrix; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.stat.Statistics; > import org.apache.spark.rdd.RDD; > import org.junit.Assert; > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > import java.math.RoundingMode; > import java.text.DecimalFormat; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > import java.util.stream.Collectors; > public class TestSparkCorr { > private static JavaSparkContext ctx; > public static void main(String[] args) { > List> doublesLists = readInputFile(); > List resultVectors = getVectorsList(doublesLists); > //=== > initSpark("*"); > RDD RDD_vector = ctx.parallelize(resultVectors).rdd(); > Matrix m = Statistics.corr(RDD_vector, "pearson"); > stopSpark(); > //=== > initSpark("3"); > RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd(); > Matrix m3 = Statistics.corr(RDD_vector_3, "pearson"); > stopSpark(); > //=== > Assert.assertEquals(m3, m); > } > private static List getVectorsList(List> doublesLists) { > List resultVectors = new ArrayList<>(doublesLists.size()); > for (List vector : doublesLists) { > double[] x = new double[vector.size()]; > for(int i = 0; i < x.length; i++){ > x[i] = vector.get(i); > } > resultVectors.add(new DenseVector(x)); > } > return resultVectors; > } > private static List> readInputFile() { > List> doublesLists = new ArrayList<>(); > try (BufferedReader reader = new BufferedReader(new FileReader( > ".//output//vectorList.txt"))) { > String line = reader.readLine(); > while (line != null) { > String[] splitLine = line.substring(1, line.length() - 2).split(","); > List doubles = Arrays.stream(splitLine).map(x -> > Double.valueOf(x.trim())).collect(Collectors.toList()); > doublesLists.add(doubles); > // read next line > line = reader.readLine(); > } > } catch (IOException e) { > e.printStackTrace(); > } > return doublesLists; > } > private static void initSpark(String coresNum) { > final SparkConf sparkConf = new SparkConf().setAppName("Span"); > sparkConf.setMaster(String.format("local[%s]", coresNum)); > sparkConf.set("spark.ui.enabled", "false"); > ctx = new JavaSparkContext(sparkConf); > } > private static void stopSpark() { > ctx.stop(); > if(ctx.sc().isStopped()){ > ctx = null; > } > } > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27842) Inconsistent results of Statistics.corr() and PearsonCorrelation.computeCorrelationMatrix()
[ https://issues.apache.org/jira/browse/SPARK-27842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850631#comment-16850631 ] Peter Nijem commented on SPARK-27842: - The issue reproduces as well on the latest Spark version. Already tested prior to reporting on this issue. _spark-core_2.12_ _2.4.3_ _spark-mllib_2.12_ _2.4.3_ > Inconsistent results of Statistics.corr() and > PearsonCorrelation.computeCorrelationMatrix() > --- > > Key: SPARK-27842 > URL: https://issues.apache.org/jira/browse/SPARK-27842 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, Windows >Affects Versions: 2.3.1 >Reporter: Peter Nijem >Priority: Major > Attachments: vectorList.txt > > > Hi, > I am working with Spark Java API in local mode (1 node, 8 cores). Spark > version as follows in my pom.xml: > *MLLib* > _spark-mllib_2.11_ > _2.3.1_ > *Core* > _spark-core_2.11_ > _2.3.1_ > I am experiencing inconsistent results of correlation when starting my Spark > application with 8 cores vs 1/2/3 cores. > I've created a Main class which reads from a file a list of Vectors; 240 > Vector which each Vector is of the length of 226. > As you can see, I am firstly initializing Spark with local[*], running > Correlation, saving the Matrix and stopping Spark. Then, I do the same but > for local[3]. > I am expecting to get the same matrices on both runs. But this is not the > case. The input file is attached. > I tried to compute the correlation using > PearsonCorrealtion.computeCorrelationMatrix() but I faced the same issue here > as well. > > In my work, I am dependent on the resulting correlation matrix. Thus, I am > experiencing bad results in y application due to the inconsistent results I > am getting. As a workaround, I am working now with local[1] > > > {code:java} > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Matrix; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.stat.Statistics; > import org.apache.spark.rdd.RDD; > import org.junit.Assert; > import java.io.BufferedReader; > import java.io.FileReader; > import java.io.IOException; > import java.math.RoundingMode; > import java.text.DecimalFormat; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.List; > import java.util.stream.Collectors; > public class TestSparkCorr { > private static JavaSparkContext ctx; > public static void main(String[] args) { > List> doublesLists = readInputFile(); > List resultVectors = getVectorsList(doublesLists); > //=== > initSpark("*"); > RDD RDD_vector = ctx.parallelize(resultVectors).rdd(); > Matrix m = Statistics.corr(RDD_vector, "pearson"); > stopSpark(); > //=== > initSpark("3"); > RDD RDD_vector_3 = ctx.parallelize(resultVectors).rdd(); > Matrix m3 = Statistics.corr(RDD_vector_3, "pearson"); > stopSpark(); > //=== > Assert.assertEquals(m3, m); > } > private static List getVectorsList(List> doublesLists) { > List resultVectors = new ArrayList<>(doublesLists.size()); > for (List vector : doublesLists) { > double[] x = new double[vector.size()]; > for(int i = 0; i < x.length; i++){ > x[i] = vector.get(i); > } > resultVectors.add(new DenseVector(x)); > } > return resultVectors; > } > private static List> readInputFile() { > List> doublesLists = new ArrayList<>(); > try (BufferedReader reader = new BufferedReader(new FileReader( > ".//output//vectorList.txt"))) { > String line = reader.readLine(); > while (line != null) { > String[] splitLine = line.substring(1, line.length() - 2).split(","); > List doubles = Arrays.stream(splitLine).map(x -> > Double.valueOf(x.trim())).collect(Collectors.toList()); > doublesLists.add(doubles); > // read next line > line = reader.readLine(); > } > } catch (IOException e) { > e.printStackTrace(); > } > return doublesLists; > } > private static void initSpark(String coresNum) { > final SparkConf sparkConf = new SparkConf().setAppName("Span"); > sparkConf.setMaster(String.format("local[%s]", coresNum)); > sparkConf.set("spark.ui.enabled", "false"); > ctx = new JavaSparkContext(sparkConf); > } > private static void stopSpark() { > ctx.stop(); > if(ctx.sc().isStopped()){ > ctx = null; > } > } > } > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: