How to read multiple libsvm files in Spark?

2018-09-20 Thread Md. Rezaul Karim
I'm experiencing "Exception in thread "main" java.io.IOException: Multiple
input paths are not supported for libsvm data" exception while trying to
read multiple libsvm files using Spark 2.3.0:

val URLs =
spark.read.format("libsvm").load("url_svmlight.tar/url_svmlight/*.svm")

Any other alternatives?


Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Md. Rezaul Karim
Hi All,

Thanks for prompt response. Really appreciated! Here's a few more info:

1. Spark version: 2.3.0
2. vCore: 8
3. RAM: 32GB
4. Deploy mode: Spark standalone

*Operation performed:* I did transformations using StringIndexer on some
columns and null imputations. That's all.

*Why writing back into CSV:* I need to write the dataframe into CSV to be
used by a non-Spark application. Nevertheless, I need to perform
pre-processing on a larger-dataset (about 2GB) and this one is just a
simple. So writing into parquet or ORC is not a viable option for me.

I was trying to use Spark for only pre-processing. By the way, I tried
using Spark builtin CSV library too.




Best,



Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
Tel: +49 241 80-21527

On 9 March 2018 at 13:41, Teemu Heikkilä <te...@emblica.fi> wrote:

> Sounds like you’re doing something else than just writing the same file
> back to disk, what your preprocessing consists?
>
> Sometimes you can save lot’s of space by using other formats but now we’re
> talking over 200x increase in file size so depending on the transformations
> for the data you might not get so huge savings by using some other format.
>
> If you can give more details about what you are doing with the data we
> could probably help with your task.
>
> Slowness probably happens because Spark is using disk to process the data
> into single partition for writing the single file, one thing to reconsider
> is that if you can merge the product files after the process or even
> pre-partition it for it’s final use case.
>
> - Teemu
>
> On 9.3.2018, at 12.23, Md. Rezaul Karim <rezaul.ka...@insight-centre.org>
> wrote:
>
> Dear All,
>
> I have a tiny CSV file, which is around 250MB. There are only 30 columns
> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
> another CSV file on disk for later usage.
>
> However, I'm getting pissed off as writing the resultant DataFrame is
> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
> file written on the disk is about 58GB!
>
> Here's the sample code that I tried:
>
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.
> csv").save("data/file.csv")
>
> # Using coalesce()
> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("
> data/file.csv")
>
>
> Any better suggestion?
>
>
>
> 
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
> eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
> Tel: +49 241 80-21527 <+49%20241%208021527>
>
>
>


Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Md. Rezaul Karim
Dear All,

I have a tiny CSV file, which is around 250MB. There are only 30 columns in
the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
another CSV file on disk for later usage.

However, I'm getting pissed off as writing the resultant DataFrame is
taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
file written on the disk is about 58GB!

Here's the sample code that I tried:

# Using repartition()
myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")

# Using coalesce()
myDF.
coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")


Any better suggestion?




Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
Tel: +49 241 80-21527


Reinforcement Learning with Spark

2018-01-05 Thread Md. Rezaul Karim
Hi All,

Is there any Reinforcement Learning algorithm implemented with Spark -i.e.
any link to GitHub/open source project etc.?


Best,



Md. Rezaul Karim, BSc, MSc
Research Scientist, Fraunhofer FIT, Germany

Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany

eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
Tel: +49 241 80-21527


SpecificColumnarIterator has grown past JVM limit of 0xFFF

2017-11-17 Thread Md. Rezaul Karim
Dear All,

I was training the RandomForest with an input dataset having 20,000 columns
and 12,000 rows.
But when I start the training, it shows an exception:

Constant pool for class
org.apache.spark.sql.catalyst.expressions.GeneratedClass$*SpecificColumnarIterator
has grown past JVM limit of 0xFFF*

I understand that the current implementation cannot handle so many columns.
However, I was still wondering if there's any workaround to handle a
dataset like this?





Kind regards,
_

*Md. Rezaul Karim*, BSc, MSc

Research Scientist, Fraunhofer FIT, Germany
PhD Researcher, Information Systems, RWTH Aachen University, Germany
*Email:* rezaul.ka...@fit.fraunhofer.de
*Phone*: +49 241 80 21527 <%2B%2B49%20241%2080%2021527>

*Web:* http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Md. Rezaul Karim
Hi Nick,

Both approaches worked and I realized my silly mistake too. Thank you so
much.

@Xu, thanks for the update.





Best,

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 30 October 2017 at 10:40, Weichen Xu <weichen...@databricks.com> wrote:

> Yes I am working on this. Sorry for late, but I will try to submit PR
> ASAP. Thanks!
>
> On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> For now, you must follow this approach of constructing a pipeline
>> consisting of a StringIndexer for each categorical column. See
>> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA
>> to allow multiple columns for StringIndexer, which is being worked on
>> currently.
>>
>> The reason you're seeing a NPE is:
>>
>> var indexers: Array[StringIndexer] = null
>>
>> and then you're trying to append an element to something that is null.
>>
>> Try this instead:
>>
>> var indexers: Array[StringIndexer] = Array()
>>
>>
>> But even better is a more functional approach:
>>
>> val indexers = featureCol.map { colName =>
>>
>>   new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
>>
>> }
>>
>>
>> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> Hi All,
>>>
>>> There are several categorical columns in my dataset as follows:
>>> [image: grafik.png]
>>>
>>> How can I transform values in each (categorical) columns into numeric
>>> using StringIndexer so that the resulting DataFrame can be feed into
>>> VectorAssembler to generate a feature vector?
>>>
>>> A naive approach that I can try using StringIndexer for each
>>> categorical column. But that sounds hilarious, I know.
>>> A possible workaround
>>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
>>> PySpark is combining several StringIndexer on a list and use a Pipeline
>>> to execute them all as follows:
>>>
>>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
>>> indexers = [StringIndexer(inputCol=column, 
>>> outputCol=column+"_index").fit(df) for column in 
>>> list(set(df.columns)-set(['date'])) ]
>>> pipeline = Pipeline(stages=indexers)
>>> df_r = pipeline.fit(df).transform(df)
>>> df_r.show()
>>>
>>> How I can do the same in Scala? I tried the following:
>>>
>>> val featureCol = trainingDF.columns
>>> var indexers: Array[StringIndexer] = null
>>>
>>> for (colName <- featureCol) {
>>>   val index = new StringIndexer()
>>> .setInputCol(colName)
>>> .setOutputCol(colName + "_indexed")
>>> //.fit(trainDF)
>>>   indexers = indexers :+ index
>>> }
>>>
>>>  val pipeline = new Pipeline()
>>> .setStages(indexers)
>>> val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>>> newDF.show()
>>>
>>> However, I am experiencing NullPointerException at
>>>
>>> for (colName <- featureCol)
>>>
>>> I am sure, I am doing something wrong. Any suggestion?
>>>
>>>
>>>
>>> Regards,
>>> _
>>> *Md. Rezaul Karim*, BSc, MSc
>>> Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> <http://139.59.184.114/index.html>
>>>
>>
>


StringIndexer on several columns in a DataFrame with Scala

2017-10-27 Thread Md. Rezaul Karim
Hi All,

There are several categorical columns in my dataset as follows:
[image: Inline images 1]

How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical
column. But that sounds hilarious, I know.
A possible workaround
<https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
PySpark is combining several StringIndexer on a list and use a Pipeline to
execute them all as follows:

from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(df) for column in
list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()

How I can do the same in Scala? I tried the following:

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
  val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
  indexers = indexers :+ index
}

 val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()

However, I am experiencing NullPointerException at

for (colName <- featureCol)

I am sure, I am doing something wrong. Any suggestion?



Regards,
_____________
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


WARN: Truncated the string representation with df.describe()

2017-10-16 Thread Md. Rezaul Karim
Hi,

When I try to see the statistics in a DataFrame using the df.describe()
method, I am experiencing the following WARN and as a result, nothing is
getting printed:

17/10/16 18:37:54 WARN Utils: Truncated the string representation of a plan
since it was too large. This behavior can be adjusted by setting
'spark.debug.maxToStringFields' in SparkEnv.conf.

Now, is it possible to configure it from Spark application or IDE such as
Eclipse other than doing this in the SparkEnv.conf? I tried the following
on Spark application from Eclipse but does not work:

*Try 1: *
spark.conf.set("spark.debug.maxToStringFields", 1)

*Try 2: *
val DEFAULT_MAX_TO_STRING_FIELDS = 2500
if (SparkEnv.get != null) {
  SparkEnv.get.conf.getInt("spark.debug.maxToStringFields",
DEFAULT_MAX_TO_STRING_FIELDS)
} else {
  DEFAULT_MAX_TO_STRING_FIELDS
}

Any clue?


Bayesian network with Saprk

2017-09-11 Thread Md. Rezaul Karim
Hi All,

I am planning to use a Bayesian network to integrate and infer the links
between miRNA and proteins based on their expression.

Is there any implementation in Spark for the Bayesian network so that I can
adapt to feed my data?




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: [Spark ML] LogisticRegressionWithSGD

2017-06-29 Thread Md. Rezaul Karim
+1


On Jun 29, 2017 10:46 PM, "Kevin Quinn"  wrote:

> Hello,
>
> I'd like to build a system that leverages semi-online updates and I wanted
> to use stochastic gradient descent.  However, after looking at the
> documentation it looks like that method is deprecated.  Is there a reason
> why it was deprecated?  Is there a planned replacement?  As far as I know
> L-BFGS cannot perform online updates (at least not in the code I read).
> Any help would be appreciated!
>
> Thanks.
>


RE: IDE for python

2017-06-28 Thread Md. Rezaul Karim
By the way, Pycharm from JetBrians also have a community edition which is
free and open source.

Moreover, if you are a student, you can use the professional edition for
students as well.

For more, see here https://www.jetbrains.com/student/

On Jun 28, 2017 11:18 AM, "Sotola, Radim"  wrote:

> Pycharm is good choice. I buy monthly subscription and can see that the
> PyCharm development continue  (I mean that this is not tool which somebody
> develop and leave it without any upgrades).
>
>
>
> *From:* Abhinay Mehta [mailto:abhinay.me...@gmail.com]
> *Sent:* Wednesday, June 28, 2017 11:06 AM
> *To:* ayan guha 
> *Cc:* User ; Xiaomeng Wan 
> *Subject:* Re: IDE for python
>
>
>
> I use Pycharm and it works a treat. The big advantage I find is that I can
> use the same command shortcuts that I do when developing with IntelliJ IDEA
> when doing Scala or Java.
>
>
>
>
>
> On 27 June 2017 at 23:29, ayan guha  wrote:
>
> Depends on the need. For data exploration, i use notebooks whenever I can.
> For developement, any good text editor should work, I use sublime. If you
> want auto completion and all, you can use eclipse or pycharm, I do not :)
>
>
>
> On Wed, 28 Jun 2017 at 7:17 am, Xiaomeng Wan  wrote:
>
> Hi,
>
> I recently switched from scala to python, and wondered which IDE people
> are using for python. I heard about pycharm, spyder etc. How do they
> compare with each other?
>
>
>
> Thanks,
>
> Shawn
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>


Re: Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Thanks, Sean. I will ask them to do so.







Regards,
_
*Md. Rezaul Karim*, BSc, MSc, PhD
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 25 June 2017 at 12:39, Sean Owen <so...@cloudera.com> wrote:

> Please get Packt to fix their existing PR. It's been open for months
> https://github.com/apache/spark-website/pull/35
>
> On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi Sean,
>>
>> Last time, you helped me add a book info (in the books section) on this
>> page https://spark.apache.org/documentation.html.
>>
>> Could you please add another book info. Here's necessary information
>> about the book:
>>
>> *Title*: Scala and Spark for Big Data Analytics
>> *Authors*: Md. Rezaul Karim, Sridhar Alla
>> *Publisher*: Packt Publishing
>> *URL*: https://www.packtpub.com/big-data-and-business-
>> intelligence/scala-and-spark-big-data-analytics
>>
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc, PhD
>> Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>


Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Hi Sean,

Last time, you helped me add a book info (in the books section) on this
page https://spark.apache.org/documentation.html.

Could you please add another book info. Here's necessary information about
the book:

*Title*: Scala and Spark for Big Data Analytics
*Authors*: Md. Rezaul Karim, Sridhar Alla
*Publisher*: Packt Publishing
*URL*:
https://www.packtpub.com/big-data-and-business-intelligence/scala-and-spark-big-data-analytics





Regards,
_
*Md. Rezaul Karim*, BSc, MSc, PhD
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: How to convert Spark MLlib vector to ML Vector?

2017-04-10 Thread Md. Rezaul Karim
Hi Yan, Ryan, and Nick,

Actually, for a special use case, I had to use RDD-based Spark MLlib which
did not work eventually. Therefore, I had to switch to Spark ML later on.

Thanks for your support guys.




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 10 April 2017 at 06:45, 颜发才(Yan Facai) <facai@gmail.com> wrote:

> how about using
>
> val dataset = spark.read.format("libsvm")
>   .option("numFeatures", "780")
>   .load("data/mllib/sample_libsvm_data.txt")
>
> instead of
> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")
>
>
>
>
>
> On Mon, Apr 10, 2017 at 11:19 AM, Ryan <ryan.hd@gmail.com> wrote:
>
>> you could write a udf using the asML method along with some type casting,
>> then apply the udf to data after pca.
>>
>> when using pipeline, that udf need to be wrapped in a customized
>> transformer, I think.
>>
>> On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> Why not use the RandomForest from Spark ML?
>>>
>>> On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>>> I have already posted this question to the StackOverflow
>>>> <http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
>>>> However, not getting any response from someone else. I'm trying to use
>>>> RandomForest algorithm for the classification after applying the PCA
>>>> technique since the dataset is pretty high-dimensional. Here's my source
>>>> code:
>>>>
>>>> import org.apache.spark.mllib.util.MLUtils
>>>> import org.apache.spark.mllib.tree.RandomForest
>>>> import org.apache.spark.mllib.tree.model.RandomForestModel
>>>> import org.apache.spark.mllib.regression.LabeledPoint
>>>> import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
>>>> import org.apache.spark.sql._
>>>> import org.apache.spark.sql.SQLContext
>>>> import org.apache.spark.sql.SparkSession
>>>>
>>>> import org.apache.spark.ml.feature.PCA
>>>> import org.apache.spark.rdd.RDD
>>>>
>>>> object PCAExample {
>>>>   def main(args: Array[String]): Unit = {
>>>> val spark = SparkSession
>>>>   .builder
>>>>   .master("local[*]")
>>>>   .config("spark.sql.warehouse.dir", "E:/Exp/")
>>>>   .appName(s"OneVsRestExample")
>>>>   .getOrCreate()
>>>>
>>>> val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, 
>>>> "data/mnist.bz2")
>>>>
>>>> val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
>>>> val (trainingData, testData) = (splits(0), splits(1))
>>>>
>>>> val sqlContext = new SQLContext(spark.sparkContext)
>>>> import sqlContext.implicits._
>>>> val trainingDF = trainingData.toDF("label", "features")
>>>>
>>>> val pca = new PCA()
>>>>   .setInputCol("features")
>>>>   .setOutputCol("pcaFeatures")
>>>>   .setK(100)
>>>>   .fit(trainingDF)
>>>>
>>>> val pcaTrainingData = pca.transform(trainingDF)
>>>> //pcaTrainingData.show()
>>>>
>>>> val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
>>>>   row.getAs[Double]("label"),
>>>>   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))
>>>>
>>>> //val labeled = pca.transform(trainingDF).rdd.map(row => 
>>>> LabeledPoint(row.getAs[Double]("label"),
>>>> //  
>>>> Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"
>>>>
>>>> val numClasses = 10
>>>> val categoricalFeaturesInfo = Map[Int, Int]()
>>>> val numTrees = 10 // Use more in practice.
>>>> val featureSubsetStrategy = "auto" // Let the algorithm choose.
>>>> val impurity = "gini"
>>>> val maxDepth = 20
>>>> val maxBins = 32
>>>>
>>>> val model = RandomForest.trainClassifier(labeled, numClasses, 
>>>> categoricalFeaturesInfo,
>>>>   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
>>>>   }
>>>> }
>>>>
>>>> However, I'm getting the following error:
>>>>
>>>> *Exception in thread "main" java.lang.IllegalArgumentException:
>>>> requirement failed: Column features must be of type
>>>> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
>>>> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*
>>>>
>>>> What am I doing wrong in my code?  Actually, I'm getting the above
>>>> exception in this line:
>>>>
>>>> val pca = new PCA()
>>>>   .setInputCol("features")
>>>>   .setOutputCol("pcaFeatures")
>>>>   .setK(100)
>>>>   .fit(trainingDF) /// GETTING EXCEPTION HERE
>>>>
>>>> Please, someone, help me to solve the problem.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Kind regards,
>>>> *Md. Rezaul Karim*
>>>>
>>>
>>
>


How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Md. Rezaul Karim
I have already posted this question to the StackOverflow
<http://stackoverflow.com/questions/43263942/how-to-convert-spark-mllib-vector-to-ml-vector>.
However, not getting any response from someone else. I'm trying to use
RandomForest algorithm for the classification after applying the PCA
technique since the dataset is pretty high-dimensional. Here's my source
code:

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession

import org.apache.spark.ml.feature.PCA
import org.apache.spark.rdd.RDD

object PCAExample {
  def main(args: Array[String]): Unit = {
val spark = SparkSession
  .builder
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "E:/Exp/")
  .appName(s"OneVsRestExample")
  .getOrCreate()

val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2")

val splits = dataset.randomSplit(Array(0.7, 0.3), seed = 12345L)
val (trainingData, testData) = (splits(0), splits(1))

val sqlContext = new SQLContext(spark.sparkContext)
import sqlContext.implicits._
val trainingDF = trainingData.toDF("label", "features")

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(100)
  .fit(trainingDF)

val pcaTrainingData = pca.transform(trainingDF)
//pcaTrainingData.show()

val labeled = pca.transform(trainingDF).rdd.map(row => LabeledPoint(
  row.getAs[Double]("label"),
  row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")))

//val labeled = pca.transform(trainingDF).rdd.map(row =>
LabeledPoint(row.getAs[Double]("label"),
//  
Vector.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"

val numClasses = 10
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 20
val maxBins = 32

val model = RandomForest.trainClassifier(labeled, numClasses,
categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
  }
}

However, I'm getting the following error:

*Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Column features must be of type
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.*

What am I doing wrong in my code?  Actually, I'm getting the above
exception in this line:

val pca = new PCA()
  .setInputCol("features")
  .setOutputCol("pcaFeatures")
  .setK(100)
  .fit(trainingDF) /// GETTING EXCEPTION HERE

Please, someone, help me to solve the problem.





Kind regards,
*Md. Rezaul Karim*


Research paper used in GraphX

2017-03-31 Thread Md. Rezaul Karim
Hi All,

Could anyone please tell me which research paper(s) was/were used to
implement the metrics like strongly connected components, page rank,
triangle count, closeness centrality, clustering coefficient etc. in Spark
GrpahX?




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
Ph.D. Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: Question on Spark's graph libraries

2017-03-10 Thread Md. Rezaul Karim
+1

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 10 March 2017 at 12:10, Robin East <robin.e...@xense.co.uk> wrote:

> I would love to know the answer to that too.
> 
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 9 Mar 2017, at 17:42, enzo <e...@smartinsightsfromdata.com> wrote:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Enzo
>
>
>


Re: Debugging Spark application

2017-02-16 Thread Md. Rezaul Karim
Thanks, Sam. I will have a look at it.

On Feb 16, 2017 10:06 PM, "Sam Elamin" <hussam.ela...@gmail.com> wrote:

> I recommend running spark in local mode when your first debugging your
> code just to understand what's happening and step through it, perhaps catch
> a few errors when you first start off
>
> I personally use intellij because it's my preference You can follow this
> guide.
> http://www.bigendiandata.com/2016-08-26-How-to-debug-
> remote-spark-jobs-with-IntelliJ/
>
> Although it's for intellij you can apply the same concepts to eclipse *I
> think*
>
>
> Regards
> Sam
>
>
> On Thu, 16 Feb 2017 at 22:00, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi,
>>
>> I was looking for some URLs/documents for getting started on debugging
>> Spark applications.
>>
>> I prefer developing Spark applications with Scala on Eclipse and then
>> package the application jar before submitting.
>>
>>
>>
>> Kind regards,
>> Reza
>>
>>
>>
>>
>


Debugging Spark application

2017-02-16 Thread Md. Rezaul Karim
Hi,

I was looking for some URLs/documents for getting started on debugging
Spark applications.

I prefer developing Spark applications with Scala on Eclipse and then
package the application jar before submitting.



Kind regards,
Reza


Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Thanks for the great help. Appreciated!

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 11 February 2017 at 13:11, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Moved to https://github.com/amplab/spark-ec2.
> Yea, I think the script just was moved there, so you can use it in the
> same way.
>
> On Sat, Feb 11, 2017 at 9:59 PM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi Takeshi,
>>
>> Now  I understand that spark-ec2 script was moved to AMPLab. How could I
>> use that one i.e. new location/URL, please? Alternatively, can I use the
>> same script provided with prior Spark releases?
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>> On 11 February 2017 at 12:43, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Have you checked this?
>>> https://issues.apache.org/jira/browse/SPARK-12735
>>>
>>> // maropu
>>>
>>> On Sat, Feb 11, 2017 at 9:34 PM, Md. Rezaul Karim <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>>> Dear Spark Users,
>>>>
>>>> I was wondering why the EC2 script is missing in Spark release
>>>> 2.0.0.~2.1.0? Is there any specific reason for that?
>>>>
>>>> Please note that I have chosen the package type: Pre-built for Hadoop
>>>> 2.7 and later for Spark 2.1.0 for example. Am I doing something wrong?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> _
>>>> *Md. Rezaul Karim*, BSc, MSc
>>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>>> National University of Ireland, Galway
>>>> IDA Business Park, Dangan, Galway, Ireland
>>>> Web: http://www.reza-analytics.eu/index.html
>>>> <http://139.59.184.114/index.html>
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Hi Takeshi,

Now  I understand that spark-ec2 script was moved to AMPLab. How could I
use that one i.e. new location/URL, please? Alternatively, can I use the
same script provided with prior Spark releases?

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 11 February 2017 at 12:43, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> Have you checked this?
> https://issues.apache.org/jira/browse/SPARK-12735
>
> // maropu
>
> On Sat, Feb 11, 2017 at 9:34 PM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Dear Spark Users,
>>
>> I was wondering why the EC2 script is missing in Spark release
>> 2.0.0.~2.1.0? Is there any specific reason for that?
>>
>> Please note that I have chosen the package type: Pre-built for Hadoop 2.7
>> and later for Spark 2.1.0 for example. Am I doing something wrong?
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Dear Spark Users,

I was wondering why the EC2 script is missing in Spark release
2.0.0.~2.1.0? Is there any specific reason for that?

Please note that I have chosen the package type: Pre-built for Hadoop 2.7
and later for Spark 2.1.0 for example. Am I doing something wrong?



Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Md. Rezaul Karim
Thanks, Bryan. Got your point.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 6 February 2017 at 13:21, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote:

> Hello.
>
> When specifying GC options for Spark you must determine where you want the
> GC options specified - on the executors or on the driver. When you submit
> your job, for the driver, specify '--driver-java-options
> "-XX:+PrintFlagsFinal  -verbose:gc", etc.  For the executor specify --conf
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal  -verbose:gc", etc.
>
> Bryan Jeffrey
>
> On Mon, Feb 6, 2017 at 8:02 AM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Dear All,
>>
>> Is there any way to specify verbose GC -i.e. “-verbose:gc
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps” in Spark submit?
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>
>


How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Md. Rezaul Karim
Dear All,

Is there any way to specify verbose GC -i.e. “-verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps” in Spark submit?



Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: DAG Visualization option is missing on Spark Web UI

2017-01-30 Thread Md. Rezaul Karim
Hi Mark,

That worked for me! Thanks a million.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 29 January 2017 at 01:53, Mark Hamstra <m...@clearstorydata.com> wrote:

> Try selecting a particular Job instead of looking at the summary page for
> all Jobs.
>
> On Sat, Jan 28, 2017 at 4:25 PM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi Jacek,
>>
>> I tried accessing Spark web UI on both Firefox and Google Chrome browsers
>> with ad blocker enabled. I do see other options like* User, Total
>> Uptime, Scheduling Mode, **Active Jobs, Completed Jobs and* Event
>> Timeline. However, I don't see an option for DAG visualization.
>>
>> Please note that I am experiencing the same issue with Spark 2.x (i.e.
>> 2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
>> that I am seeing on my machine:
>>
>> [image: Inline images 1]
>>
>>
>> Please suggest.
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>> On 28 January 2017 at 18:51, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>>> Hi,
>>>
>>> Wonder if you have any adblocker enabled in your browser? Is this the
>>> only version giving you this behavior? All Spark jobs have no
>>> visualization?
>>>
>>> Jacek
>>>
>>> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <
>>> rezaul.ka...@insight-centre.org> wrote:
>>>
>>> Hi All,
>>>
>>> I am running a Spark job on my local machine written in Scala with Spark
>>> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
>>> http://localhost:4040/jobs/
>>>
>>>
>>> Suggestion, please.
>>>
>>>
>>>
>>>
>>> Regards,
>>> _
>>> *Md. Rezaul Karim*, BSc, MSc
>>> PhD Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> <http://139.59.184.114/index.html>
>>>
>>>
>>>
>>
>


Pruning decision tree in Spark

2017-01-30 Thread Md. Rezaul Karim
Hi there,

Say, I have a deeper tree that needs to be pruned to create an optimal
tree. For example, in R it can be done using *rpart/prune *function.

Is it possible to prune a* Spark MLlib/ML-based decision tree* while
performing a classification or regression task?




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Md. Rezaul Karim
Hi Jacek,

I tried accessing Spark web UI on both Firefox and Google Chrome browsers
with ad blocker enabled. I do see other options like* User, Total Uptime,
Scheduling Mode, **Active Jobs, Completed Jobs and* Event Timeline.
However, I don't see an option for DAG visualization.

Please note that I am experiencing the same issue with Spark 2.x (i.e.
2.0.0, 2.0.1, 2.0.2 and 2.1.0). Refer the attached screenshot of the UI
that I am seeing on my machine:

[image: Inline images 1]


Please suggest.




Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 28 January 2017 at 18:51, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Wonder if you have any adblocker enabled in your browser? Is this the only
> version giving you this behavior? All Spark jobs have no visualization?
>
> Jacek
>
> On 28 Jan 2017 7:03 p.m., "Md. Rezaul Karim" <rezaul.karim@insight-centre.
> org> wrote:
>
> Hi All,
>
> I am running a Spark job on my local machine written in Scala with Spark
> 2.1.0. However, I am not seeing any option of "*DAG Visualization*" at 
> http://localhost:4040/jobs/
>
>
> Suggestion, please.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>
>
>


DAG Visualization option is missing on Spark Web UI

2017-01-28 Thread Md. Rezaul Karim
Hi All,

I am running a Spark job on my local machine written in Scala with Spark
2.1.0. However, I am not seeing any option of "*DAG Visualization*" at
http://localhost:4040/jobs/


Suggestion, please.




Regards,
_____
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: Text

2017-01-27 Thread Md. Rezaul Karim
Some operations like map, filter, flatMap and coalesce (with shuffle=false)
usually preserve the order. However, sortBy, reduceBy, partitionBy, join
etc. do not.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 27 January 2017 at 09:44, Soheila S. <soheila...@gmail.com> wrote:

> Hi All,
> I read a test file using sparkContext.textfile(filename) and assign it to
> an RDD and process the RDD (replace some words) and finally write it to
> a text file using rdd.saveAsTextFile(output).
> Is there any way to be sure the order of the sentences will not be
> changed? I need to have the same text with some corrected words.
>
> thanks!
>
> Soheila
>


Re: How to tune number of tesks

2017-01-26 Thread Md. Rezaul Karim
Hi,

If you require all the partitioned to be saved with saveAsTextFile you can
use coalesce(1,true).saveAsTextFile(). This basically means do the
computation then coalesce to only 1 partition. You can also use
repartition(1) too which is just a wrapper for the coalesce that sets the
shuffle argument as TRUE.

Val yourRDD = 
yourRDD.coalesce(1).saveAsTextFile("data/output")


Hope that helps.



Regards,
_____
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 26 January 2017 at 16:21, Soheila S. <soheila...@gmail.com> wrote:

> Hi all,
>
> Please tell me how can I tune output partition numbers.
> I run my spark job on my local machine with 8 cores and input data is
> 6.5GB. It creates 193 tasks and put the output into 193 partitions.
> How can I change the number of tasks and consequently, the number of
> output files?
>
> Best,
> Soheila
>


How to reduce number of tasks and partitions in Spark job?

2017-01-26 Thread Md. Rezaul Karim
Hi All,

When I run a Spark job on my local machine (having 8 cores and 16GB of RAM)
on an input data of 6.5GB, it creates 193 parallel tasks and put
the output into 193 partitions.

How can I change the number of tasks and consequently, the number of output
files - say to just one or less?





Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: "Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-19 Thread Md. Rezaul Karim
Thanks, Sean. I will explore online more.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 19 January 2017 at 10:59, Sean Owen <so...@cloudera.com> wrote:

> It's a message from Hadoop libs, not Spark. It can be safely ignored. It's
> just saying you haven't installed the additional (non-Apache-licensed)
> native libs that can accelerate some operations. This is something you can
> easily have read more about online.
>
> On Thu, Jan 19, 2017 at 10:57 AM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi All,
>>
>> I'm the getting the following WARNING while running Spark jobs  in
>> standalone mode:
>> Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>>
>> Please note that I have configured the native path and the other ENV
>> variables as follows:
>> export JAVA_HOME=/usr/lib/jvm/java-8-oracle
>> export HADOOP_HOME=/usr/local/hadoop
>> export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/lib/native
>> export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
>> export JAVA_LIBRARY_PATH=/usr/local/hadoop/lib/native
>> export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
>>
>>
>> Although my Spark job executes successfully and writes the results to a
>> file at the end. However, I am not getting any logs to track the progress.
>>
>> Could someone help me to solve this problem?
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>


"Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-19 Thread Md. Rezaul Karim
Hi All,

I'm the getting the following WARNING while running Spark jobs  in
standalone mode:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable

Please note that I have configured the native path and the other ENV
variables as follows:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/lib/native
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
export JAVA_LIBRARY_PATH=/usr/local/hadoop/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"


Although my Spark job executes successfully and writes the results to a
file at the end. However, I am not getting any logs to track the progress.

Could someone help me to solve this problem?




Regards,
_____
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Parsing RDF data with Spark

2017-01-18 Thread Md. Rezaul Karim
Hi All,

Is there any way to parse Linked Data in RDF(.n3,. ttl, .nq,. nt) format
with Spark?



Kind regards,
Reza


Re: Old version of Spark [v1.2.0]

2017-01-15 Thread Md. Rezaul Karim
Hi Ayan,

Thanks a million.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 15 January 2017 at 22:48, ayan guha <guha.a...@gmail.com> wrote:

> archive.apache.org will always have all the releases:
> http://archive.apache.org/dist/spark/
>
> @Spark users: it may be a good idea to have a "To download older versions,
> click here" link to Spark Download page?
>
> On Mon, Jan 16, 2017 at 8:16 AM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi,
>>
>> I am looking for Spark 1.2.0 version. I tried to download in the Spark
>> website but it's no longer available.
>>
>> Any suggestion?
>>
>>
>>
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim*, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Old version of Spark [v1.2.0]

2017-01-15 Thread Md. Rezaul Karim
Hi,

I am looking for Spark 1.2.0 version. I tried to download in the Spark
website but it's no longer available.

Any suggestion?






Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


H2O DataFrame to Spark RDD/DataFrame

2017-01-12 Thread Md. Rezaul Karim
Hi there,

Is there any way to convert an H2O DataFrame to equivalent Spark RDD or
DataFrame? I found a good documentation on "*Machine Learning with
Sparkling Water: H2O + Spark*" here at.
<https://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/docs-website/h2o-docs/booklets/SparklingWaterVignette.pdf>

However, it discusses how to convert a Spark RDD or DaataFrame to H2O
DatFrame but not the vice-versa.




Regards,
_____________
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: How to save spark-ML model in Java?

2017-01-12 Thread Md. Rezaul Karim
Hi Malshan,

The error says that one (or more) of the estimators/stages is either not
writable or compatible that supports overwrite/model write operation.

Suppose you want to configure an ML pipeline consisting of three stages
(i.e. estimator): tokenizer, hashingTF, and nb:
val nb = new NaiveBayes().setSmoothing(0.1)
val tokenizer = new
Tokenizer().setInputCol("label").setOutputCol("label")
val hashingTF = new
HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF,
nb))


Now check if all the stages are writable. And to make it ease try saving
stages individually:  -e.g. tokenizer.write.save("path")

hashingTF.write.save("path")
After that suppose you want to perform a 10-fold cross-validation as
follows:
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(10)

Where:
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(10, 100,
1000))
.addGrid(nb.smoothing, Array(0.001, 0.0001))
.build()

Now the model that you trained using the training set should be writable if
all of the stages are okay:
val model = cv.fit(trainingData)
model.write.overwrite().save("output/NBModel")



Hope that helps.







Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 12 January 2017 at 09:09, Minudika Malshan <minudika...@gmail.com> wrote:

> Hi,
>
> When I try to save a pipeline model using spark ML (Java) , the following
> exception is thrown.
>
>
> java.lang.UnsupportedOperationException: Pipeline write will fail on this
> Pipeline because it contains a stage which does not implement Writable.
> Non-Writable stage: rfc_98f8c9e0bd04 of type class org.apache.spark.ml.
> classification.Rand
>
>
> Here is my code segment.
>
>
> model.write().overwrite,save
>
>
> model.write().overwrite().save("path
> model.write().overwrite().save("mypath");
>
>
> How to resolve this?
>
> Thanks and regards!
>
> Minudika
>
>


Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hi,

Currently, I have been using Spark 2.1.0 for ML and so far did not
experience any critical issue. It's much stable compared to Spark
2.0.1/2.0.2 I would say.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 9 January 2017 at 16:36, Ankur Jain <ankur.j...@yash.com> wrote:

>
>
> Thanks Rezaul…
>
>
>
> Is Spark 2.1.0 still have any issues w.r.t. stability?
>
>
>
> Regards,
>
> Ankur
>
>
>
> *From:* Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org]
> *Sent:* Monday, January 09, 2017 5:02 PM
> *To:* Ankur Jain <ankur.j...@yash.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Machine Learning in Spark 1.6 vs Spark 2.0
>
>
>
> Hello Jain,
>
> I would recommend using Spark MLlib
> <http://spark.apache.org/docs/latest/ml-guide.html>(and ML) of *Spark
> 2.1.0* with the following features:
>
>- ML Algorithms: common learning algorithms such as classification,
>regression, clustering, and collaborative filtering
>- Featurization: feature extraction, transformation, dimensionality
>reduction, and selection
>- Pipelines: tools for constructing, evaluating, and tuning ML
>Pipelines
>- Persistence: saving and load algorithms, models, and Pipelines
>- Utilities: linear algebra, statistics, data handling, etc.
>
> These features will help make your machine learning scalable and easy too.
>
>
> Regards,
> _
>
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
>
> IDA Business Park, Dangan, Galway, Ireland
>
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>
>
>
> On 9 January 2017 at 10:19, Ankur Jain <ankur.j...@yash.com> wrote:
>
> Hi Team,
>
>
>
> I want to start a new project with ML. But wanted to know which version of
> Spark is much stable and have more features w.r.t ML
>
> Please suggest your opinion…
>
>
>
> Thanks in Advance…
>
>
>
> [image: cid:image013.png@01D1AAE2.28F7BBF0]
>
> *Thanks & Regards*
>
> Ankur Jain
>
> Technical Architect – Big Data | IoT | Innovation Group
>
> Board: +91-731-663-6363 <+91%20731%20663%206363>
>
> Direct: +91-731-663-6125 <+91%20731%20663%206125>
>
> *www.yash.com <http://www.yash.com/>*
>
> Follow YASH:
>
> [image: cid:image002.png@01CF5E10.26C55CF0]
> <http://www.linkedin.com/company/yash-technologies>
>
> [image: cid:image003.png@01CF5E10.26C55CF0] <http://twitter.com/YASH_Tech>
>
> [image: cid:image004.png@01CF5E10.26C55CF0]
> <http://www.facebook.com/pages/YASH-Technologies/139932139377994>
>
> [image: cid:image005.png@01CF5E10.26C55CF0]
> <https://plus.google.com/106560310768370862129/posts>
>
> [image: cid:image006.png@01CF5E10.26C55CF0]
> <http://www.youtube.com/yashtechnologies>
>
>  [image: Solutions-Architect-Associate]  *[image:
> cid:image010.png@01D1AD0C.4AFA3760]*  *[image: GPTWF LOGO]*
>
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>


Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hello Jain,

I would recommend using Spark MLlib
<http://spark.apache.org/docs/latest/ml-guide.html>(and ML) of *Spark 2.1.0*
with the following features:

   - ML Algorithms: common learning algorithms such as classification,
   regression, clustering, and collaborative filtering
   - Featurization: feature extraction, transformation, dimensionality
   reduction, and selection
   - Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
   - Persistence: saving and load algorithms, models, and Pipelines
   - Utilities: linear algebra, statistics, data handling, etc.

These features will help make your machine learning scalable and easy too.



Regards,
_____
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 9 January 2017 at 10:19, Ankur Jain <ankur.j...@yash.com> wrote:

> Hi Team,
>
>
>
> I want to start a new project with ML. But wanted to know which version of
> Spark is much stable and have more features w.r.t ML
>
> Please suggest your opinion…
>
>
>
> Thanks in Advance…
>
>
>
> [image: cid:image013.png@01D1AAE2.28F7BBF0]
>
> *Thanks & Regards*
>
> Ankur Jain
>
> Technical Architect – Big Data | IoT | Innovation Group
>
> Board: +91-731-663-6363 <+91%20731%20663%206363>
>
> Direct: +91-731-663-6125 <+91%20731%20663%206125>
>
> *www.yash.com <http://www.yash.com/>*
>
> Follow YASH:
>
> [image: cid:image002.png@01CF5E10.26C55CF0]
> <http://www.linkedin.com/company/yash-technologies>
>
> [image: cid:image003.png@01CF5E10.26C55CF0] <http://twitter.com/YASH_Tech>
>
> [image: cid:image004.png@01CF5E10.26C55CF0]
> <http://www.facebook.com/pages/YASH-Technologies/139932139377994>
>
> [image: cid:image005.png@01CF5E10.26C55CF0]
> <https://plus.google.com/106560310768370862129/posts>
>
> [image: cid:image006.png@01CF5E10.26C55CF0]
> <http://www.youtube.com/yashtechnologies>
>
>  [image: Solutions-Architect-Associate]  *[image:
> cid:image010.png@01D1AD0C.4AFA3760]*  *[image: GPTWF LOGO]*
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>


Re: Issue with SparkR setup on RStudio

2017-01-04 Thread Md. Rezaul Karim
Cheung,

The problem has been solved after switching from Windows to Linux
environment.

Thanks.



Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 2 January 2017 at 18:59, Felix Cheung <felixcheun...@hotmail.com> wrote:

> Perhaps it is with
>
> spark.sql.warehouse.dir="E:/Exp/"
>
> That you have in the sparkConfig parameter.
>
> Unfortunately the exception stack is fairly far away from the actual
> error, but from the top of my head spark.sql.warehouse.dir and HADOOP_HOME
> are the two different pieces that is not set in the Windows tests.
>
>
> _
> From: Md. Rezaul Karim <rezaul.ka...@insight-centre.org>
> Sent: Monday, January 2, 2017 7:58 AM
> Subject: Re: Issue with SparkR setup on RStudio
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: spark users <user@spark.apache.org>
>
>
> Hello Cheung,
>
> Happy New Year!
>
> No, I did not configure Hive on my machine. Even I have tried not setting
> the HADOOP_HOME but getting the same error.
>
>
>
> Regards,
> _
> *Md. Rezaul Karim* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web:http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>
> On 29 December 2016 at 19:16, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Any reason you are setting HADOOP_HOME?
>>
>> From the error it seems you are running into issue with Hive config
>> likely with trying to load hive-site.xml. Could you try not setting
>> HADOOP_HOME
>>
>>
>> --
>> *From:* Md. Rezaul Karim <rezaul.ka...@insight-centre.org>
>> *Sent:* Thursday, December 29, 2016 10:24:57 AM
>> *To:* spark users
>> *Subject:* Issue with SparkR setup on RStudio
>>
>>
>> Dear Spark users,
>>
>> I am trying to setup SparkR on RStudio to perform some basic data
>> manipulations and MLmodeling.  However, I am a strange error while
>> creating SparkR session or DataFrame that 
>> says:java.lang.IllegalArgumentException
>> Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState.
>>
>> According to Spark documentation athttp://spark.apache.org/
>> docs/latest/sparkr.html#starting-up-sparksession, I don’t need to
>> configure Hive path or related variables.
>>
>> I have the following source code:
>>
>> SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
>> HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"
>>
>> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
>> "lib")))
>> sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]",
>> sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/",
>> spark.driver.memory = "8g"), enableHiveSupport = TRUE)
>>
>> # Create a simple local data.frame
>> localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
>> # Convert local data frame to a SparkDataFrame
>> df <- createDataFrame(localDF)
>> print(df)
>> head(df)
>> sparkR.session.stop()
>>
>> Please note that the HADOOP_HOME contains the ‘*winutils.exe’* file. The
>> details of the eror is as follows:
>>
>> Error in handleErrors(returnStatus, conn) :  
>> java.lang.IllegalArgumentException: Error while instantiating 
>> 'org.apache.spark.sql.hive.HiveSessionState':
>>
>>at 
>> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>>
>>at 
>> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>>
>>at 
>> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>>
>>at 
>> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
>>
>>at 
>> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
>>
>>at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>>
>>at scala.colle

RBackendHandler Error while running ML algorithms with SparkR on RStudio

2017-01-03 Thread Md. Rezaul Karim
Dear Spark Users,

I was trying to execute RandomForest and NaiveBayes algorithms on RStudio
but experiencing the following error:

17/01/03 15:04:11 ERROR RBackendHandler: fit on
org.apache.spark.ml.r.NaiveBayesWrapper
failed
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.
java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(
RBackendHandler.scala:141)
Caused by: java.io.IOException: Class not found
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.io.IOException: Class not found

Here's my source code:

Sys.setenv(SPARK_HOME = "spark-2.1.0-bin-hadoop2.7")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)
sparkR.session(appName = "SparkR-NB", master = "local[*]", sparkConfig =
list(spark.driver.memory = "2g"))

# Fit a Bernoulli naive Bayes model with spark.naiveBayes
titanic <- as.data.frame(Titanic)
titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
nbDF <- titanicDF
nbTestDF <- titanicDF
nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)

# Model summary
summary(nbModel)

# Prediction
nbPredictions <- predict(nbModel, nbTestDF)
showDF(nbPredictions)




Someone please help me to get rid of this error.


Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: Issue with SparkR setup on RStudio

2017-01-02 Thread Md. Rezaul Karim
Hello Cheung,

Happy New Year!

No, I did not configure Hive on my machine. Even I have tried not setting
the HADOOP_HOME but getting the same error.



Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 29 December 2016 at 19:16, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Any reason you are setting HADOOP_HOME?
>
> From the error it seems you are running into issue with Hive config likely
> with trying to load hive-site.xml. Could you try not setting HADOOP_HOME
>
>
> ----------
> *From:* Md. Rezaul Karim <rezaul.ka...@insight-centre.org>
> *Sent:* Thursday, December 29, 2016 10:24:57 AM
> *To:* spark users
> *Subject:* Issue with SparkR setup on RStudio
>
>
> Dear Spark users,
>
> I am trying to setup SparkR on RStudio to perform some basic data
> manipulations and ML modeling.  However, I am a strange error while
> creating SparkR session or DataFrame that says: 
> java.lang.IllegalArgumentException
> Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState.
>
> According to Spark documentation at http://spark.apache.org/docs/
> latest/sparkr.html#starting-up-sparksession, I don’t need to configure
> Hive path or related variables.
>
> I have the following source code:
>
> SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
> HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"
>
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
> "lib")))
> sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]",
> sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory
> = "8g"), enableHiveSupport = TRUE)
>
> # Create a simple local data.frame
> localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
> # Convert local data frame to a SparkDataFrame
> df <- createDataFrame(localDF)
> print(df)
> head(df)
> sparkR.session.stop()
>
> Please note that the HADOOP_HOME  contains the ‘*winutils.exe’* file. The
> details of the eror is as follows:
>
> Error in handleErrors(returnStatus, conn) :  
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>
>at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>
>at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>
>at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>
>at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
>
>at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
>
>at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>
>at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>
>at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>
>at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
>at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>
>at scala.collection.Traversabl
>
>
>  Any kind of help would be appreciated.
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> <http://139.59.184.114/index.html>
>


Issue with SparkR setup on RStudio

2016-12-29 Thread Md. Rezaul Karim
Dear Spark users,

I am trying to setup SparkR on RStudio to perform some basic data
manipulations and ML modeling.  However, I am a strange error while
creating SparkR session or DataFrame that says:
java.lang.IllegalArgumentException
Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState.

According to Spark documentation at
http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I
don’t need to configure Hive path or related variables.

I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R",
"lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]",
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory =
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()

Please note that the HADOOP_HOME  contains the ‘*winutils.exe’* file. The
details of the eror is as follows:

Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':

   at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

       at scala.collection.Traversabl


 Any kind of help would be appreciated.




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: Running spark from Eclipse and then Jar

2016-12-10 Thread Md. Rezaul Karim
Hello Iman,

Finally, I managed to solve the problem. I had been experiencing the
problem because of the locking issue in the "*metastore_db*" under the
project tree on Eclipse.

If you see the project tree, under the "*metastore_db*" folder you should
see a file named "*db.lck*" file which was preventing the jar to be
executed from the command line.

I just deleted that file, packaged my project as jar again and finally the
problem resolved.




Regards,
_________
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 8 December 2016 at 01:15, Iman Mohtashemi <iman.mohtash...@gmail.com>
wrote:

> yes exactly. I run mine fine in Eclipse but when I run it from a
> corresponding jar I get the same error!
>
> On Wed, Dec 7, 2016 at 5:04 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> I believe, it's not about the location (i.e., local machine or HDFS) but
>> it's all about the format of the input file. For example, I am getting the
>> following error while trying to read an input file in libsvm format:
>>
>> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data  source: libsvm. *
>>
>> The application works fine on Eclipse. However, while packaging the
>> corresponding jar file, I am getting the above error which is really weird!
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim* BSc, MSc
>>
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>> On 7 December 2016 at 23:39, Iman Mohtashemi <iman.mohtash...@gmail.com>
>> wrote:
>>
>> No but I tried that too and still didn't work. Where are the files being
>> read from? From the local machine or HDFS? Do I need to get the files to
>> HDFS first? In Eclipse I just point to the location of the directory?
>>
>> On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>> Hi,
>>
>> You should prepare your jar file (from your Spark application written in
>> Java) with all the necessary dependencies. You can create a Maven project
>> on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
>>
>> For building the jar with the dependencies and *main class (since you
>> are getting the **ClassNotFoundException)* your pom.xml should contain
>> the following in the *build *tag (example main class is marked in Red
>> color):
>>
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-eclipse-plugin
>> 2.9
>> 
>> true
>> false
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-compiler-plugin
>> 3.5.1
>> 
>> ${jdk.version}
>> ${jdk.version}
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin
>> 2.4.3
>> 
>> true
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-assembly-plugin
>> 2.4.1
>> 
>> 
>> 
>> jar-with-
>> dependencies
>> 
>> 
>> 
>> 
>> com.example.
>> RandomForest.SongPrediction
>> 
>> 
>>
>>         
>> oozie.launcher.
>> mapreduce.job.user.classpath.first
>> true
>> 
>>
>> 
>> 
>> 
>> make-assembly
>> 
>>   

Re: Random Forest hangs without trace of error

2016-12-09 Thread Md. Rezaul Karim
I had similar experience last week. Even I could not find any error trace.

Later on, I did the following to get rid of the problem:
i) I downgraded to Spark 2.0.0
ii) Decreased the value of maxBins and maxDepth

Additionally, make sure that you set the featureSubsetStrategy as "auto" to
let the algorithm choose the best feature subset strategy for your data.
Finally, set the impurity as "gini" for the information gain.

However, setting the value of no. of trees to just 1 does not give you
either real advantage of the forest neither better predictive performance.



Best,
Karim


On Dec 9, 2016 11:29 PM, "mhornbech"  wrote:

> Hi
>
> I have spent quite some time trying to debug an issue with the Random
> Forest
> algorithm on Spark 2.0.2. The input dataset is relatively large at around
> 600k rows and 200MB, but I use subsampling to make each tree manageable.
> However even with only 1 tree and a low sample rate of 0.05 the job hangs
> at
> one of the final stages (see attached). I have checked the logs on all
> executors and the driver and find no traces of error. Could it be a memory
> issue even though no error appears? The error does seem sporadic to some
> extent so I also wondered whether it could be a data issue, that only
> occurs
> if the subsample includes the bad data rows.
>
> Please comment if you have a clue.
>
> Morten
>
>  file/n28192/Sk%C3%A6rmbillede_2016-12-10_kl.png>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Random-Forest-hangs-without-trace-of-
> error-tp28192.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


"Failed to find data source: libsvm" while running Spark application with jar

2016-12-08 Thread Md. Rezaul Karim
Hi there,

I am getting the following error while trying read an input file in libsvm
format during running a Spark application jar.


*Exception in thread "main" java.lang.ClassNotFoundException: Failed to
find data  source: libsvm. *
*at*


*org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)*
The remain error log contains similar message.

The Java application works fine on Eclipse. However, while packaging and
running the corresponding jar file, I am getting the above error which is
really weird!

I believe, it's all about the format of the input file. Any kind of help is
appreciated.


Regards,
_____
*Md. Rezaul Karim* BSc, MSc
Ph.D. Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
I believe, it's not about the location (i.e., local machine or HDFS) but
it's all about the format of the input file. For example, I am getting the
following error while trying to read an input file in libsvm format:

*Exception in thread "main" java.lang.ClassNotFoundException: Failed to
find data  source: libsvm. *

The application works fine on Eclipse. However, while packaging the
corresponding jar file, I am getting the above error which is really weird!



Regards,
_____
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 7 December 2016 at 23:39, Iman Mohtashemi <iman.mohtash...@gmail.com>
wrote:

> No but I tried that too and still didn't work. Where are the files being
> read from? From the local machine or HDFS? Do I need to get the files to
> HDFS first? In Eclipse I just point to the location of the directory?
>
> On Wed, Dec 7, 2016 at 3:34 PM Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi,
>>
>> You should prepare your jar file (from your Spark application written in
>> Java) with all the necessary dependencies. You can create a Maven project
>> on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.
>>
>> For building the jar with the dependencies and *main class (since you
>> are getting the **ClassNotFoundException)* your pom.xml should contain
>> the following in the *build *tag (example main class is marked in Red
>> color):
>>
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-eclipse-plugin
>> 2.9
>> 
>> true
>> false
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-compiler-plugin
>> 3.5.1
>> 
>> ${jdk.version}
>> ${jdk.version}
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-shade-plugin
>> 2.4.3
>> 
>> true
>> 
>> 
>> 
>> 
>> org.apache.maven.plugins
>> maven-assembly-plugin
>> 2.4.1
>> 
>> 
>> 
>> jar-with-
>> dependencies
>> 
>> 
>> 
>> 
>> com.example.
>> RandomForest.SongPrediction
>> 
>> 
>>
>> 
>> oozie.launcher.
>> mapreduce.job.user.classpath.first
>> true
>> 
>>
>> 
>> 
>>     
>> make-assembly
>> 
>> package
>> 
>> single
>> 
>> 
>> 
>> 
>> 
>> 
>>
>>
>> An example pom.xml file has been attached for your reference. Feel free
>> to reuse it.
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim,* BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>> On 7 December 2016 at 23:18, im281 <iman.mohtash...@gmail.com> wrote:
>>
>> Hello,
>> I have a simple word count example in Java and I can run this in Eclipse
>> (code at the bottom)
>>
>> I then create a jar file from it and try to run it from the cmd
>>
>>
>> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>>
>> But I get this error?
>>
>> I think the main error is:
>> *Exception in thread "main" java.lang.ClassNotFoundException: Failed t

Re: Running spark from Eclipse and then Jar

2016-12-07 Thread Md. Rezaul Karim
Hi,

You should prepare your jar file (from your Spark application written in
Java) with all the necessary dependencies. You can create a Maven project
on Eclipse by specifying the dependencies in a Maven friendly pom.xml file.

For building the jar with the dependencies and *main class (since you are
getting the **ClassNotFoundException)* your pom.xml should contain the
following in the *build *tag (example main class is marked in Red color):





org.apache.maven.plugins
maven-eclipse-plugin
2.9

true
false




org.apache.maven.plugins
maven-compiler-plugin
3.5.1

${jdk.version}
${jdk.version}



org.apache.maven.plugins
maven-shade-plugin
2.4.3

true




org.apache.maven.plugins
maven-assembly-plugin
2.4.1



jar-with-dependencies





com.example.RandomForest.SongPrediction





oozie.launcher.mapreduce.job.user.classpath.first
true





make-assembly

package

single








An example pom.xml file has been attached for your reference. Feel free to
reuse it.


Regards,
_
*Md. Rezaul Karim,* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 7 December 2016 at 23:18, im281 <iman.mohtash...@gmail.com> wrote:

> Hello,
> I have a simple word count example in Java and I can run this in Eclipse
> (code at the bottom)
>
> I then create a jar file from it and try to run it from the cmd
>
>
> java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt
>
> But I get this error?
>
> I think the main error is:
> *Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find
> data source: text*
>
> Any advise on how to run this jar file in spark would be appreciated
>
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner
> 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to:
> 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(Owner);
> groups
> with view permissions: Set(); users  with modify permissions: Set(Owner);
> groups with modify permissions: Set()
> 16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on
> port 10211.
> 16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker
> 16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster
> 16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at
> C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-
> 44fd-a75e-1a0450556873
> 16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity
> 1984.5
> MB
> 16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on
> port
> 4040.
> 16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://192.168.19.2:4040
> 16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host
> localhost
> 16/12/07 15:16:45 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252.
> 16/12/07 15:16:45 INFO NettyBlockTransferService: Server created on
> 192.168.19.2:10252
> 16/12/07 15:16:45 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, 192.168.19.2, 10252)
> 16/12/07 15:16:45 INFO BlockManagerMasterEndpoint: Registering block
&

Pruning decision tree to create an optimal tree

2016-12-07 Thread Md. Rezaul Karim
Hi there,

Say, I have a deeper tree that needs to be pruned to create an optimal
tree. For example, in R it can be done using the *rpart *and *prune *function.


Is it possible to prune the MLlib-based decision tree while performing the
classification or regression?




Regards,
_
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Re: How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
Hi Sean,

According to Spark documentation, precision, recall, F1, true positive
rate, false positive rate etc. can be calculated using the MultiMetrics
evaluator for the multiclass classifiers also. For example in *Random
Forest *based classifier or regressor:

// Get evaluation metrics.
MulticlassMetrics metrics = new
MulticlassMetrics(predictionAndLabels.rdd());
//System.out.println(metrics.confusionMatrix());
   // System.out.println(metrics.confusionMatrix());
double precision = metrics.precision(metrics.labels()[0]);
double recall = metrics.recall(metrics.labels()[0]);
double f_measure = metrics.fMeasure();
double query_label = 2001; // it's a label or class for prediction
double TP = metrics.truePositiveRate(query_label);
double FP = metrics.falsePositiveRate(query_label);
double WTP = metrics.weightedTruePositiveRate();
double WFP =  metrics.weightedFalsePositiveRate();

Where the related performance measure statistics is calculated and stored
in '*predictionAndLabels*' RDD as follows:
JavaRDD<Tuple2<Object, Object>> predictionAndLabels = testData.map(
new Function<LabeledPoint, Tuple2<Object, Object>>() {
  public Tuple2<Object, Object> call(LabeledPoint p) {
Double prediction = model.predict(p.features());
return new Tuple2<Object, Object>(prediction, p.label());
  }
}
  );
And *'model'* is a Random Forest model instance trained with multiclass
regression or classification dataset.

The current implementation of Logistic Regression supports only the binary
classification. But, Linear Regression supports/works on the dataset having
multiclass.

I was wondering if it's possible to compute the similar metrics using the
Linear Regression based model for multiclass or binary class dataset.



Regards,
_________
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 6 December 2016 at 11:37, Sean Owen <so...@cloudera.com> wrote:

> Precision, recall and F1 are metrics for binary classifiers, not
> regression models. Can you clarify what you intend to do?
>
> On Tue, Dec 6, 2016, 19:14 Md. Rezaul Karim <rezaul.karim@insight-centre.
> org> wrote:
>
>> Hi Folks,
>>
>> I have the following code snippet in Java that can calculate the
>> precision in Linear Regressor based model.
>>
>> Dataset predictions = model.transform(testData);
>> long count = 0;
>>  for (Row r : predictions.select("features", "label",
>> "prediction").collectAsList()) {
>>count++;
>> }
>>   System.out.println("precision: " + (double) (count * 100) /
>> predictions.count());
>>
>> Now, I would like to compute other evaluation metrics like *Recall *and 
>> *F1-score
>> *etc. How could I do that?
>>
>>
>>
>> Regards,
>> _
>> *Md. Rezaul Karim* BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>


How to compute the recall and F1-score in Linear Regression based model

2016-12-06 Thread Md. Rezaul Karim
Hi Folks,

I have the following code snippet in Java that can calculate the precision
in Linear Regressor based model.

Dataset predictions = model.transform(testData);
long count = 0;
 for (Row r : predictions.select("features", "label",
"prediction").collectAsList()) {
   count++;
}
  System.out.println("precision: " + (double) (count * 100) /
predictions.count());

Now, I would like to compute other evaluation metrics like *Recall
*and *F1-score
*etc. How could I do that?



Regards,
_____
*Md. Rezaul Karim* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Multilabel classification with Spark MLlib

2016-11-29 Thread Md. Rezaul Karim
Hello All,

Is there anyone who has developed multilabel classification applications
with Spark?

I found an example class in Spark distribution (i.e.,
*JavaMultiLabelClassificationMetricsExample.java*) which is not a
classifier but an evaluator for a multilabel classification. Moreover, the
example is not well documented (i.e., I did not understand which one is a
label and which one is a feature).

More specifically, I was looking for some example implemented in
Java/Scala/Python so that I can develop my own multi-label classification
applications.



Any kind of help would be highly appreciated.






Regards,
_
*Md. Rezaul Karim,* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>


Multilabel classification with Spark MLlib

2016-11-25 Thread Md. Rezaul Karim
Hello All,

Is there anyone who has developed multilabel classification applications
with Spark?

I found an example class in Spark distribution (i.e.,
*JavaMultiLabelClassificationMetricsExample.java*) which is not a
classifier but an evaluator for a multilabel classification. Moreover, the
example is not well documented (i.e., I did not understand which one is a
label and which one is a feature).

More specifically, I was looking for some example implemented in
Java/Scala/Python so that I can develop my own multi-label classification
applications.



Any kind of help would be highly appreciated.






Regards,
_
*Md. Rezaul Karim,* BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>