Alberto Bonsanto created SPARK-11992:
----------------------------------------

             Summary: Severl numbers in my spark shell (pyspark)
                 Key: SPARK-11992
                 URL: https://issues.apache.org/jira/browse/SPARK-11992
             Project: Spark
          Issue Type: Question
          Components: MLlib, PySpark
    Affects Versions: 1.5.2
         Environment: Linux Ubuntu 14.04 LTS
Jupyter 
Spark 1.5.2
            Reporter: Alberto Bonsanto
            Priority: Blocker


The problem is very weird, I am currently trying to fit some classifiers from 
mllib library (SVM, LogisticRegression, RandomForest, DecisionTree and 
NaiveBayes), so they might classify the data properly, I am trying to compare 
their performances evaluating their predictions using my current validation 
data (the typical pipeline), and the problem is that when I try to fit any of 
those, my spark-shell console prints millions and millions of entries, and 
after that the fitting process gets stopped, you can see it 
[here|http://i.imgur.com/mohLnwr.png]

Some details:
- My data has around 15M of entries.
- I use LabeledPoints to represent each entry, where the features are 
SparseVectors and they have *104* features or dimensions.
- I don't show many things in the console, 
[log4j.properties|https://gist.github.com/Bonsanto/c487624db805f56882b8]
- The program is running locally in a computer with 16GB of RAM. 

I have already asked this, in StackOverflow, you can see it here [Crazy 
print|http://stackoverflow.com/questions/33807347/pyspark-shell-outputs-several-numbers-instead-of-the-loading-arrow]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to