Re: Linear regression + Janino Exception

2016-11-20 Thread janardhan shetty
Seems like this is associated to :
https://issues.apache.org/jira/browse/SPARK-16845

On Sun, Nov 20, 2016 at 6:09 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Hi,
>
> I am trying to execute Linear regression algorithm for Spark 2.02 and
> hitting the below error when I am fitting my training set:
>
> val lrModel = lr.fit(train)
>
>
> It happened on 2.0.0 as well. Any resolution steps is appreciated.
>
> *Error Snippet: *
> 16/11/20 18:03:45 *ERROR CodeGenerator: failed to compile:
> org.codehaus.janino.JaninoRuntimeException: Code of method
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
> of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
> grows beyond 64 KB*
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificUnsafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificUnsafeProjection extends
> org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
> scalaUDF;
> /* 009 */   private scala.Function1 catalystConverter;
> /* 010 */   private scala.Function1 converter;
> /* 011 */   private scala.Function1 udf;
> /* 012 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
> scalaUDF1;
> /* 013 */   private scala.Function1 catalystConverter1;
> /* 014 */   private scala.Function1 converter1;
> /* 015 */   private scala.Function1 udf1;
> /* 016 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
> scalaUDF2;
> /* 017 */   private scala.Function1 catalystConverter2;
>
>
>


Linear regression + Janino Exception

2016-11-20 Thread janardhan shetty
Hi,

I am trying to execute Linear regression algorithm for Spark 2.02 and
hitting the below error when I am fitting my training set:

val lrModel = lr.fit(train)


It happened on 2.0.0 as well. Any resolution steps is appreciated.

*Error Snippet: *
16/11/20 18:03:45 *ERROR CodeGenerator: failed to compile:
org.codehaus.janino.JaninoRuntimeException: Code of method
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
of class
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
grows beyond 64 KB*
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
scalaUDF;
/* 009 */   private scala.Function1 catalystConverter;
/* 010 */   private scala.Function1 converter;
/* 011 */   private scala.Function1 udf;
/* 012 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
scalaUDF1;
/* 013 */   private scala.Function1 catalystConverter1;
/* 014 */   private scala.Function1 converter1;
/* 015 */   private scala.Function1 udf1;
/* 016 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF
scalaUDF2;
/* 017 */   private scala.Function1 catalystConverter2;


Re: Usage of mllib api in ml

2016-11-20 Thread janardhan shetty
Hi Marco and Yanbo,

It is not the usage of MulticlassClassificationEvaluator. Probably I was
not clear.  Let me explain:

I am trying to use confusionMatrix which is not present in
MulticlassClassificationEvaluator ml version where as it is present in
MulticlassMetrics of mllib.
How to leverage RDD version  using ml dataframes ?

*mllib*: MulticlassMetrics
*ml*: MulticlassClassificationEvaluator

On Sun, Nov 20, 2016 at 4:52 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hi
>   you can also have a look at this example,
>
> https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/
> scala/com/cloudera/datascience/rdf/RunRDF.scala#L220
>
> kr
>  marco
>
> On Sun, Nov 20, 2016 at 9:09 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> You can refer this example(http://spark.apache.or
>> g/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation)
>> which use BinaryClassificationEvaluator, and it should be very
>> straightforward to switch to MulticlassClassificationEvaluator.
>>
>> Thanks
>> Yanbo
>>
>> On Sat, Nov 19, 2016 at 9:03 AM, janardhan shetty <janardhan...@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am trying to use the evaluation metrics offered by mllib
>>> multiclassmetrics in ml dataframe setting.
>>> Is there any examples how to use it?
>>>
>>
>>
>


Usage of mllib api in ml

2016-11-19 Thread janardhan shetty
Hi,

I am trying to use the evaluation metrics offered by mllib
multiclassmetrics in ml dataframe setting.
Is there any examples how to use it?


Re: Log-loss for multiclass classification

2016-11-16 Thread janardhan shetty
I am sure some work might be in pipeline as it is a normal evaluation
criteria. Any thoughts or links ?

On Nov 15, 2016 11:15 AM, "janardhan shetty" <janardhan...@gmail.com> wrote:

> Hi,
>
> Best practice for multi class classification technique is to evaluate the
> model by *log-loss*.
> Is there any jira or work going on to implement the same in
>
> *MulticlassClassificationEvaluator*
>
> Currently it supports following :
> (supports "f1" (default), "weightedPrecision", "weightedRecall",
> "accuracy")
>


Log-loss for multiclass classification

2016-11-15 Thread janardhan shetty
Hi,

Best practice for multi class classification technique is to evaluate the
model by *log-loss*.
Is there any jira or work going on to implement the same in

*MulticlassClassificationEvaluator*

Currently it supports following :
(supports "f1" (default), "weightedPrecision", "weightedRecall", "accuracy")


Re: Convert SparseVector column to Densevector column

2016-11-14 Thread janardhan shetty
This worked thanks Maropu.

On Sun, Nov 13, 2016 at 9:34 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> How about this?
>
> import org.apache.spark.ml.linalg._
> val toSV = udf((v: Vector) => v.toDense)
> val df = Seq((0.1, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3))),
> (0.2, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3.toDF("a", "b")
> df.select(toSV($"b"))
>
> // maropu
>
>
> On Mon, Nov 14, 2016 at 1:20 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Is there any easy way of converting a dataframe column from SparseVector
>> to DenseVector  using
>>
>> import org.apache.spark.ml.linalg.DenseVector API ?
>>
>> Spark ML 2.0
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Convert SparseVector column to Densevector column

2016-11-13 Thread janardhan shetty
Hi,

Is there any easy way of converting a dataframe column from SparseVector to
DenseVector  using

import org.apache.spark.ml.linalg.DenseVector API ?

Spark ML 2.0


Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread janardhan shetty
These Jiras'  are still unresolved:
https://issues.apache.org/jira/browse/SPARK-11215

Also there is https://issues.apache.org/jira/browse/SPARK-8418

On Wed, Aug 17, 2016 at 11:15 AM, Nisha Muktewar <ni...@cloudera.com> wrote:

>
> The OneHotEncoder does *not* accept multiple columns.
>
> You can use Michal's suggestion where he uses Pipeline to set the stages
> and then executes them.
>
> The other option is to write a function that performs one hot encoding on
> a column and returns a dataframe with the encoded column and then call it
> multiple times for the rest of the columns.
>
>
>
>
> On Wed, Aug 17, 2016 at 10:59 AM, janardhan shetty <janardhan...@gmail.com
> > wrote:
>
>> I had already tried this way :
>>
>> scala> val featureCols = Array("category","newone")
>> featureCols: Array[String] = Array(category, newone)
>>
>> scala>  val indexer = new StringIndexer().setInputCol(fe
>> atureCols).setOutputCol("categoryIndex").fit(df1)
>> :29: error: type mismatch;
>>  found   : Array[String]
>>  required: String
>> val indexer = new StringIndexer().setInputCol(fe
>> atureCols).setOutputCol("categoryIndex").fit(df1)
>>
>>
>> On Wed, Aug 17, 2016 at 10:56 AM, Nisha Muktewar <ni...@cloudera.com>
>> wrote:
>>
>>> I don't think it does. From the documentation:
>>> https://spark.apache.org/docs/2.0.0-preview/ml-features.html
>>> #onehotencoder, I see that it still accepts one column at a time.
>>>
>>> On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> 2.0:
>>>>
>>>> One hot encoding currently accepts single input column is there a way
>>>> to include multiple columns ?
>>>>
>>>
>>>
>>
>


Re: Deep learning libraries for scala

2016-10-19 Thread janardhan shetty
Agreed. But as it states deeper integration with (scala) is yet to be
developed.
Any thoughts on how to use tensorflow with scala ? Need to write wrappers I
think.

On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:

> On that note, here is an article that Databricks made regarding using
> Tensorflow in conjunction with Spark.
>
> https://databricks.com/blog/2016/01/25/deep-learning-with-
> apache-spark-and-tensorflow.html
>
> Cheers,
> Ben
>
>
> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> while using Deep Learning you might want to stay as close to tensorflow as
> possible. There is very less translation loss, you get to access stable,
> scalable and tested libraries from the best brains in the industry and as
> far as Scala goes, it helps a lot to think about using the language as a
> tool to access algorithms in this instance unless you want to start
> developing algorithms from grounds up ( and in which case you might not
> require any libraries at all).
>
> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Are there any good libraries which can be used for scala deep learning
>> models ?
>> How can we integrate tensorflow with scala ML ?
>>
>
>
>


Re: Deep learning libraries for scala

2016-10-01 Thread janardhan shetty
Apparently there are no Neural network implementations in tensorframes
which we can use right ? or Am I missing something here.

I would like to apply neural networks for an NLP settting is there are any
implementations which can be looked into ?

On Fri, Sep 30, 2016 at 8:14 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> Tensor frames
>
> https://spark-packages.org/package/databricks/tensorframes
>
> Hope that helps
> -suresh
>
> On Sep 30, 2016, at 8:00 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
> Looking for scala dataframes in particular ?
>
> On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>
>> Skymind you could try. It is java
>>
>> I never test though.
>>
>> > On Sep 30, 2016, at 7:30 PM, janardhan shetty <janardhan...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > Are there any good libraries which can be used for scala deep learning
>> models ?
>> > How can we integrate tensorflow with scala ML ?
>>
>
>
>


Re: Deep learning libraries for scala

2016-09-30 Thread janardhan shetty
Looking for scala dataframes in particular ?

On Fri, Sep 30, 2016 at 7:46 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> Skymind you could try. It is java
>
> I never test though.
>
> > On Sep 30, 2016, at 7:30 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Are there any good libraries which can be used for scala deep learning
> models ?
> > How can we integrate tensorflow with scala ML ?
>


Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
It would be good to know which paper has inspired to implement the version
which we use in spark  2.0 decision trees ?

On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi <pete.figlio...@gmail.com>
wrote:

> It's a good question.  People have been publishing papers on decision
> trees and various methods of constructing and pruning them for over 30
> years.  I think it's rather a question for a historian at this point.
>
> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Read this explanation but wondering if this algorithm has the base from a
>> research paper for detail understanding.
>>
>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott <kevin.r.mell...@gmail.com
>> > wrote:
>>
>>> The documentation details the algorithm being used at
>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Any help here is appreciated ..
>>>>
>>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>>>> janardhan...@gmail.com> wrote:
>>>>
>>>>> Is there a reference to the research paper which is implemented in
>>>>> spark 2.0 ?
>>>>>
>>>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>>>> janardhan...@gmail.com> wrote:
>>>>>
>>>>>> Which algorithm is used under the covers while doing decision trees
>>>>>> FOR SPARK ?
>>>>>> for example: scikit-learn (python) uses an optimised version of the
>>>>>> CART algorithm.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Deep learning libraries for scala

2016-09-30 Thread janardhan shetty
Hi,

Are there any good libraries which can be used for scala deep learning
models ?
How can we integrate tensorflow with scala ML ?


Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
Read this explanation but wondering if this algorithm has the base from a
research paper for detail understanding.

On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> The documentation details the algorithm being used at
> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>
> Thanks,
> Kevin
>
> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Any help here is appreciated ..
>>
>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>> janardhan...@gmail.com> wrote:
>>
>>> Is there a reference to the research paper which is implemented in spark
>>> 2.0 ?
>>>
>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Which algorithm is used under the covers while doing decision trees FOR
>>>> SPARK ?
>>>> for example: scikit-learn (python) uses an optimised version of the
>>>> CART algorithm.
>>>>
>>>
>>>
>>
>


Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread janardhan shetty
Hi,

Any help here is appreciated ..

On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Is there a reference to the research paper which is implemented in spark
> 2.0 ?
>
> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Which algorithm is used under the covers while doing decision trees FOR
>> SPARK ?
>> for example: scikit-learn (python) uses an optimised version of the CART
>> algorithm.
>>
>
>


Re: Spark ML Decision Trees Algorithm

2016-09-28 Thread janardhan shetty
Is there a reference to the research paper which is implemented in spark
2.0 ?

On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Which algorithm is used under the covers while doing decision trees FOR
> SPARK ?
> for example: scikit-learn (python) uses an optimised version of the CART
> algorithm.
>


Spark ML Decision Trees Algorithm

2016-09-28 Thread janardhan shetty
Which algorithm is used under the covers while doing decision trees FOR
SPARK ?
for example: scikit-learn (python) uses an optimised version of the CART
algorithm.


Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Thanks Sean.
On Sep 20, 2016 7:45 AM, "Sean Owen" <so...@cloudera.com> wrote:

> Ah, I think that this was supposed to be changed with SPARK-9062. Let
> me see about reopening 10835 and addressing it.
>
> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Is this a bug?
> >
> > On Sep 19, 2016 10:10 PM, "janardhan shetty" <janardhan...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> I am hitting this issue.
> >> https://issues.apache.org/jira/browse/SPARK-10835.
> >>
> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> >> appreciated ?
> >>
> >> Note:
> >> Pipeline has Ngram before word2Vec.
> >>
> >> Error:
> >> val word2Vec = new
> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
> >>
> >> scala> word2Vec.fit(grams)
> >> java.lang.IllegalArgumentException: requirement failed: Column
> wordsGrams
> >> must be of type ArrayType(StringType,true) but was actually
> >> ArrayType(StringType,false).
> >>   at scala.Predef$.require(Predef.scala:224)
> >>   at
> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
> >>   at
> >> org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
> >>   at org.apache.spark.ml.PipelineStage.transformSchema(
> Pipeline.scala:70)
> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
> >>
> >>
> >> Github code for Ngram:
> >>
> >>
> >> override protected def validateInputType(inputType: DataType): Unit = {
> >> require(inputType.sameType(ArrayType(StringType)),
> >>   s"Input type must be ArrayType(StringType) but got $inputType.")
> >>   }
> >>
> >>   override protected def outputDataType: DataType = new
> >> ArrayType(StringType, false)
> >> }
> >>
> >
>


Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Is this a bug?
On Sep 19, 2016 10:10 PM, "janardhan shetty" <janardhan...@gmail.com> wrote:

> Hi,
>
> I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835
> .
>
> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> appreciated ?
>
> Note:
> Pipeline has Ngram before word2Vec.
>
> Error:
> val word2Vec = new Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
>
> scala> word2Vec.fit(grams)
> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
> must be of type ArrayType(StringType,true) but was actually
> ArrayType(StringType,false).
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
>   at org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
>   at org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
>   at org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>
>
> Github code for Ngram:
>
>
> override protected def validateInputType(inputType: DataType): Unit = {
> require(inputType.sameType(ArrayType(StringType)),
>   s"Input type must be ArrayType(StringType) but got $inputType.")
>   }
>
>   override protected def outputDataType: DataType = new
> ArrayType(StringType, false)
> }
>
>


SPARK-10835 in 2.0

2016-09-19 Thread janardhan shetty
Hi,

I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835.

Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
appreciated ?

Note:
Pipeline has Ngram before word2Vec.

Error:
val word2Vec = new
Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)

scala> word2Vec.fit(grams)
java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
must be of type ArrayType(StringType,true) but was actually
ArrayType(StringType,false).
  at scala.Predef$.require(Predef.scala:224)
  at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at
org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
  at
org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
  at
org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
  at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)


Github code for Ngram:


override protected def validateInputType(inputType: DataType): Unit = {
require(inputType.sameType(ArrayType(StringType)),
  s"Input type must be ArrayType(StringType) but got $inputType.")
  }

  override protected def outputDataType: DataType = new
ArrayType(StringType, false)
}


Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread janardhan shetty
Yes Sujit I have tried that option as well.
Also tried sbt assembly but hitting below issue:

http://stackoverflow.com/questions/35197120/java-outofmemoryerror-on-sbt-
assembly

Just wondering if there any clean approach to include StanfordCoreNLP
classes in spark ML ?


On Mon, Sep 19, 2016 at 1:41 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:

> Hi Janardhan,
>
> You need the classifier "models" attribute on the second entry for
> stanford-corenlp to indicate that you want the models JAR, as shown below.
> Right now you are importing two instances of stanford-corenlp JARs.
>
> libraryDependencies ++= {
>   val sparkVersion = "2.0.0"
>   Seq(
> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
> "com.google.protobuf" % "protobuf-java" % "2.6.1",
> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models",
> "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>   )
> }
>
> -sujit
>
>
> On Sun, Sep 18, 2016 at 5:12 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Hi Sujit,
>>
>> Tried that option but same error:
>>
>> java version "1.8.0_51"
>>
>>
>> libraryDependencies ++= {
>>   val sparkVersion = "2.0.0"
>>   Seq(
>> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
>> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>> "com.google.protobuf" % "protobuf-java" % "2.6.1",
>> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>> "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>>   )
>> }
>>
>> Error:
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> edu/stanford/nlp/pipeline/StanfordCoreNLP
>> at transformers.ml.Lemmatizer$$anonfun$createTransformFunc$1.ap
>> ply(Lemmatizer.scala:37)
>> at transformers.ml.Lemmatizer$$anonfun$createTransformFunc$1.ap
>> ply(Lemmatizer.scala:33)
>> at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$
>> 2.apply(ScalaUDF.scala:88)
>> at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$
>> 2.apply(ScalaUDF.scala:87)
>> at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(Scal
>> aUDF.scala:1060)
>> at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedEx
>> pressions.scala:142)
>> at org.apache.spark.sql.catalyst.expressions.InterpretedProject
>> ion.apply(Projection.scala:45)
>> at org.apache.spark.sql.catalyst.expressions.InterpretedProject
>> ion.apply(Projection.scala:29)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at scala.collection.TraversableLike$class.map(TraversableLike.
>> scala:234)
>>
>>
>>
>> On Sun, Sep 18, 2016 at 2:21 PM, Sujit Pal <sujitatgt...@gmail.com>
>> wrote:
>>
>>> Hi Janardhan,
>>>
>>> Maybe try removing the string "test" from this line in your build.sbt?
>>> IIRC, this restricts the models JAR to be called from a test.
>>>
>>> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test"
>>> classifier "models",
>>>
>>> -sujit
>>>
>>>
>>> On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to use lemmatization as a transformer and added belwo to
>>>> the build.sbt
>>>>
>>>>  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>>>> "com.google.protobuf" % "protobuf-java" % "2.6.1",
>>>> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test"
>>>> classifier "models",
>>>> "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>>>>
>>>>
>>>> Error:
>>>> *Exception in thread "main" java.lang.NoClassDefFoundError:
>>>> edu/stanford/nlp/pipeline/StanfordCoreNLP*
>>>>
>>>> I have tried other versions of this spark package.
>>>>
>>>> Any help is appreciated..
>>>>
>>>
>>>
>>
>


Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi Sujit,

Tried that option but same error:

java version "1.8.0_51"


libraryDependencies ++= {
  val sparkVersion = "2.0.0"
  Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"com.google.protobuf" % "protobuf-java" % "2.6.1",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"org.scalatest" %% "scalatest" % "2.2.6" % "test"
  )
}

Error:

Exception in thread "main" java.lang.NoClassDefFoundError:
edu/stanford/nlp/pipeline/StanfordCoreNLP
at
transformers.ml.Lemmatizer$$anonfun$createTransformFunc$1.apply(Lemmatizer.scala:37)
at
transformers.ml.Lemmatizer$$anonfun$createTransformFunc$1.apply(Lemmatizer.scala:33)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:87)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1060)
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)



On Sun, Sep 18, 2016 at 2:21 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:

> Hi Janardhan,
>
> Maybe try removing the string "test" from this line in your build.sbt?
> IIRC, this restricts the models JAR to be called from a test.
>
> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
> "models",
>
> -sujit
>
>
> On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty <janardhan...@gmail.com
> > wrote:
>
>> Hi,
>>
>> I am trying to use lemmatization as a transformer and added belwo to the
>> build.sbt
>>
>>  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>> "com.google.protobuf" % "protobuf-java" % "2.6.1",
>> "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
>> "models",
>> "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>>
>>
>> Error:
>> *Exception in thread "main" java.lang.NoClassDefFoundError:
>> edu/stanford/nlp/pipeline/StanfordCoreNLP*
>>
>> I have tried other versions of this spark package.
>>
>> Any help is appreciated..
>>
>
>


Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Also sometimes hitting this Error when spark-shell is used:

Caused by: edu.stanford.nlp.io.RuntimeIOException: Error while loading a
tagger model (probably missing model file)
  at
edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770)
  at
edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:298)
  at
edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:263)
  at
edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(POSTaggerAnnotator.java:97)
  at
edu.stanford.nlp.pipeline.POSTaggerAnnotator.(POSTaggerAnnotator.java:77)
  at
edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(AnnotatorImplementations.java:59)
  at
edu.stanford.nlp.pipeline.AnnotatorFactories$4.create(AnnotatorFactories.java:290)
  ... 114 more
Caused by: java.io.IOException: Unable to open
"edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger"
as class path, filename or URL
  at
edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:485)
  at
edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765)



On Sun, Sep 18, 2016 at 12:27 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Using: spark-shell --packages databricks:spark-corenlp:0.2.0-s_2.11
>
> On Sun, Sep 18, 2016 at 12:26 PM, janardhan shetty <janardhan...@gmail.com
> > wrote:
>
>> Hi Jacek,
>>
>> Thanks for your response. This is the code I am trying to execute
>>
>> import org.apache.spark.sql.functions._
>> import com.databricks.spark.corenlp.functions._
>>
>> val inputd = Seq(
>>   (1, "Stanford University is located in California. ")
>> ).toDF("id", "text")
>>
>> val output = 
>> inputd.select(cleanxml(col("text"))).withColumnRenamed("UDF(text)",
>> "text")
>>
>> val out = output.select(lemma(col("text"))).withColumnRenamed("UDF(text)",
>> "text")
>>
>> output.show() works
>>
>> Error happens when I execute *out.show()*
>>
>>
>>
>> On Sun, Sep 18, 2016 at 11:58 AM, Jacek Laskowski <ja...@japila.pl>
>> wrote:
>>
>>> Hi Jonardhan,
>>>
>>> Can you share the code that you execute? What's the command? Mind
>>> sharing the complete project on github?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Sun, Sep 18, 2016 at 8:01 PM, janardhan shetty
>>> <janardhan...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I am trying to use lemmatization as a transformer and added belwo to
>>> the
>>> > build.sbt
>>> >
>>> >  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>>> > "com.google.protobuf" % "protobuf-java" % "2.6.1",
>>> > "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test"
>>> classifier
>>> > "models",
>>> > "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>>> >
>>> >
>>> > Error:
>>> > Exception in thread "main" java.lang.NoClassDefFoundError:
>>> > edu/stanford/nlp/pipeline/StanfordCoreNLP
>>> >
>>> > I have tried other versions of this spark package.
>>> >
>>> > Any help is appreciated..
>>>
>>
>>
>


Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Using: spark-shell --packages databricks:spark-corenlp:0.2.0-s_2.11

On Sun, Sep 18, 2016 at 12:26 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Hi Jacek,
>
> Thanks for your response. This is the code I am trying to execute
>
> import org.apache.spark.sql.functions._
> import com.databricks.spark.corenlp.functions._
>
> val inputd = Seq(
>   (1, "Stanford University is located in California. ")
> ).toDF("id", "text")
>
> val output = 
> inputd.select(cleanxml(col("text"))).withColumnRenamed("UDF(text)",
> "text")
>
> val out = output.select(lemma(col("text"))).withColumnRenamed("UDF(text)",
> "text")
>
> output.show() works
>
> Error happens when I execute *out.show()*
>
>
>
> On Sun, Sep 18, 2016 at 11:58 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Jonardhan,
>>
>> Can you share the code that you execute? What's the command? Mind
>> sharing the complete project on github?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Sep 18, 2016 at 8:01 PM, janardhan shetty
>> <janardhan...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am trying to use lemmatization as a transformer and added belwo to the
>> > build.sbt
>> >
>> >  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
>> > "com.google.protobuf" % "protobuf-java" % "2.6.1",
>> > "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test"
>> classifier
>> > "models",
>> > "org.scalatest" %% "scalatest" % "2.2.6" % "test"
>> >
>> >
>> > Error:
>> > Exception in thread "main" java.lang.NoClassDefFoundError:
>> > edu/stanford/nlp/pipeline/StanfordCoreNLP
>> >
>> > I have tried other versions of this spark package.
>> >
>> > Any help is appreciated..
>>
>
>


Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi Jacek,

Thanks for your response. This is the code I am trying to execute

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val inputd = Seq(
  (1, "Stanford University is located in California. ")
).toDF("id", "text")

val output =
inputd.select(cleanxml(col("text"))).withColumnRenamed("UDF(text)", "text")

val out = output.select(lemma(col("text"))).withColumnRenamed("UDF(text)",
"text")

output.show() works

Error happens when I execute *out.show()*



On Sun, Sep 18, 2016 at 11:58 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Jonardhan,
>
> Can you share the code that you execute? What's the command? Mind
> sharing the complete project on github?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Sep 18, 2016 at 8:01 PM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Hi,
> >
> > I am trying to use lemmatization as a transformer and added belwo to the
> > build.sbt
> >
> >  "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
> > "com.google.protobuf" % "protobuf-java" % "2.6.1",
> > "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
> > "models",
> > "org.scalatest" %% "scalatest" % "2.2.6" % "test"
> >
> >
> > Error:
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > edu/stanford/nlp/pipeline/StanfordCoreNLP
> >
> > I have tried other versions of this spark package.
> >
> > Any help is appreciated..
>


Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi,

I am trying to use lemmatization as a transformer and added belwo to the
build.sbt

 "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"com.google.protobuf" % "protobuf-java" % "2.6.1",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
"models",
"org.scalatest" %% "scalatest" % "2.2.6" % "test"


Error:
*Exception in thread "main" java.lang.NoClassDefFoundError:
edu/stanford/nlp/pipeline/StanfordCoreNLP*

I have tried other versions of this spark package.

Any help is appreciated..


Re: LDA spark ML visualization

2016-09-13 Thread janardhan shetty
Any help is appreciated to proceed in this problem.
On Sep 12, 2016 11:45 AM, "janardhan shetty" <janardhan...@gmail.com> wrote:

> Hi,
>
> I am trying to visualize the LDA model developed in spark scala (2.0 ML)
> in LDAvis.
>
> Is there any links to convert the spark model parameters to the following
> 5 params to visualize ?
>
> 1. φ, the K × W matrix containing the estimated probability mass function
> over the W terms in the vocabulary for each of the K topics in the model.
> Note that φkw > 0 for all k ∈ 1...K and all w ∈ 1...W, because of the
> priors. (Although our software allows values of zero due to rounding). Each
> of the K rows of φ must sum to one.
> 2. θ, the D × K matrix containing the estimated probability mass function
> over the K topics in the model for each of the D documents in the corpus.
> Note that θdk > 0 for all d ∈ 1...D and all k ∈ 1...K, because of the
> priors (although, as above, our software accepts zeroes due to rounding).
> Each of the D rows of θ must sum to one.
> 3. nd, the number of tokens observed in document d, where nd is required
> to be an integer greater than zero, for documents d = 1...D. Denoted
> doc.length in our code.
> 4. vocab, the length-W character vector containing the terms in the
> vocabulary (listed in the same order as the columns of φ).
> 5. Mw, the frequency of term w across the entire corpus, where Mw is
> required to be an integer greater than zero for each term w = 1...W.
> Denoted term.frequency in our code.
>


LDA spark ML visualization

2016-09-12 Thread janardhan shetty
Hi,

I am trying to visualize the LDA model developed in spark scala (2.0 ML) in
LDAvis.

Is there any links to convert the spark model parameters to the following 5
params to visualize ?

1. φ, the K × W matrix containing the estimated probability mass function
over the W terms in the vocabulary for each of the K topics in the model.
Note that φkw > 0 for all k ∈ 1...K and all w ∈ 1...W, because of the
priors. (Although our software allows values of zero due to rounding). Each
of the K rows of φ must sum to one.
2. θ, the D × K matrix containing the estimated probability mass function
over the K topics in the model for each of the D documents in the corpus.
Note that θdk > 0 for all d ∈ 1...D and all k ∈ 1...K, because of the
priors (although, as above, our software accepts zeroes due to rounding).
Each of the D rows of θ must sum to one.
3. nd, the number of tokens observed in document d, where nd is required to
be an integer greater than zero, for documents d = 1...D. Denoted
doc.length in our code.
4. vocab, the length-W character vector containing the terms in the
vocabulary (listed in the same order as the columns of φ).
5. Mw, the frequency of term w across the entire corpus, where Mw is
required to be an integer greater than zero for each term w = 1...W.
Denoted term.frequency in our code.


Re: Spark transformations

2016-09-12 Thread janardhan shetty
Thanks Thunder. To copy the code base is difficult since we need to copy in
entirety or transitive dependency files as well.
If we need to do complex operations of taking a column as a whole instead
of each element in a row is not possible as of now.

Trying to find few pointers to easily solve this.

On Mon, Sep 12, 2016 at 9:43 AM, Thunder Stumpges <
thunder.stump...@gmail.com> wrote:

> Hi Janardhan,
>
> I have run into similar issues and asked similar questions. I also ran
> into many problems with private code when trying to write my own
> Model/Transformer/Estimator. (you might be able to find my question to the
> group regarding this, I can't really tell if my emails are getting through,
> as I don't get any responses). For now I have resorted to copying out the
> code that I need from the spark codebase and into mine. I'm certain this is
> not the best, but it has to be better than "implementing it myself" which
> was what the only response to my question said to do.
>
> As for the transforms, I also asked a similar question. The only way I've
> seen it done in code is using a UDF. As you mention, the UDF can only
> access fields on a "row by row" basis. I have not gotten any replies at all
> on my question, but I also need to do some more complicated operation in my
> work (join to another model RDD, flat-map, calculate, reduce) in order to
> get the value for the new column. So far no great solution.
>
> Sorry I don't have any answers, but wanted to chime in that I am also a
> bit stuck on similar issues. Hope we can find a workable solution soon.
> Cheers,
> Thunder
>
>
>
> On Tue, Sep 6, 2016 at 1:32 PM janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Noticed few things about Spark transformers just wanted to be clear.
>>
>> Unary transformer:
>>
>> createTransformFunc: IN => OUT  = { *item* => }
>> Here *item *is single element and *NOT* entire column.
>>
>> I would like to get the number of elements in that particular column.
>> Since there is *no forward checking* how can we get this information ?
>> We have visibility into single element and not the entire column.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Sep 4, 2016 at 9:30 AM, janardhan shetty <janardhan...@gmail.com>
>> wrote:
>>
>>> In scala Spark ML Dataframes.
>>>
>>> On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <somasundar.sekar@
>>> tigeranalytics.com> wrote:
>>>
>>>> Can you try this
>>>>
>>>> https://www.linkedin.com/pulse/hive-functions-udfudaf-
>>>> udtf-examples-gaurav-singh
>>>>
>>>> On 4 Sep 2016 9:38 pm, "janardhan shetty" <janardhan...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is there any chance that we can send entire multiple columns to an udf
>>>>> and generate a new column for Spark ML.
>>>>> I see similar approach as VectorAssembler but not able to use few
>>>>> classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable 
>>>>> since
>>>>> they are private.
>>>>>
>>>>> Any leads/examples is appreciated in this regard..
>>>>>
>>>>> Requirement:
>>>>> *Input*: Multiple columns of a Dataframe
>>>>> *Output*:  Single new modified column
>>>>>
>>>>
>>>
>>


Re: Using spark package XGBoost

2016-09-08 Thread janardhan shetty
Tried to implement spark package in 2.0
https://spark-packages.org/package/rotationsymmetry/sparkxgboost
but it is throwing the error:

error: not found: type SparkXGBoostClassifier

On Tue, Sep 6, 2016 at 11:26 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Is this merged to Spark ML ? If so which version ?
>
> On Tue, Sep 6, 2016 at 12:58 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> Sorry to bother you, but I'ld like to inform you our activities.
>> We'll start incubating our product, Hivemall, in Apache and this is a
>> scalable ML library
>> for Hive/Spark/Pig.
>>
>> - http://wiki.apache.org/incubator/HivemallProposal
>> - http://markmail.org/thread/mjwyyd4btthk3626
>>
>> I made a pr for XGBoost integration on DataFrame/Spark(https://github
>> .com/myui/hivemall/pull/281)
>> and this pr has already been merged in a master.
>> I wrote how to use the integration on my gist:
>> https://gist.github.com/maropu/33794b293ee937e99b8fb0788843fa3f
>>
>> If you are interested in the integration, could you please you try using
>> it and
>> let me know the issues that you get stuck in?
>>
>> Best regards,
>> takeshi
>>
>> // maropu
>>
>>
>>
>> On Mon, Aug 15, 2016 at 1:04 PM, Brandon White <bwwintheho...@gmail.com>
>> wrote:
>>
>>> The XGBoost integration with Spark is currently only supported for RDDs,
>>> there is a ticket for dataframe and folks calm to be working on it.
>>>
>>> On Aug 14, 2016 8:15 PM, "Jacek Laskowski" <ja...@japila.pl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've never worked with the library and speaking about sbt setup only.
>>>>
>>>> It appears that the project didn't release 2.11-compatible jars (only
>>>> 2.10) [1] so you need to build the project yourself and uber-jar it
>>>> (using sbt-assembly plugin).
>>>>
>>>> [1] https://spark-packages.org/package/rotationsymmetry/sparkxgboost
>>>>
>>>> Pozdrawiam,
>>>> Jacek Laskowski
>>>> 
>>>> https://medium.com/@jaceklaskowski/
>>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>>> Follow me at https://twitter.com/jaceklaskowski
>>>>
>>>>
>>>> On Sun, Aug 14, 2016 at 7:13 AM, janardhan shetty
>>>> <janardhan...@gmail.com> wrote:
>>>> > Any leads how to do acheive this?
>>>> >
>>>> > On Aug 12, 2016 6:33 PM, "janardhan shetty" <janardhan...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I tried using  sparkxgboost package in build.sbt file but it failed.
>>>> >> Spark 2.0
>>>> >> Scala 2.11.8
>>>> >>
>>>> >> Error:
>>>> >>  [warn]
>>>> >> http://dl.bintray.com/spark-packages/maven/rotationsymmetry/
>>>> sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar
>>>> >>[warn] ::
>>>> >>[warn] ::  FAILED DOWNLOADS::
>>>> >>[warn] :: ^ see resolution messages for details  ^ ::
>>>> >>[warn] ::
>>>> >>[warn] ::
>>>> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(src)
>>>> >>[warn] ::
>>>> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(doc)
>>>> >>
>>>> >> build.sbt:
>>>> >>
>>>> >> scalaVersion := "2.11.8"
>>>> >>
>>>> >> libraryDependencies ++= {
>>>> >>   val sparkVersion = "2.0.0-preview"
>>>> >>   Seq(
>>>> >> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>>>> >> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>>>> >> "org.apache.spark" %% "spark-streaming" % sparkVersion %
>>>> "provided",
>>>> >> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
>>>> >>   )
>>>> >> }
>>>> >>
>>>> >> resolvers += "Spark Packages Repo" at
>>>> >> "http://dl.bintray.com/spark-packages/maven;
>>>> >>
>>>> >> libraryDependencies += "rotationsymmetry" % "sparkxgboost" %
>>>> >> "0.2.1-s_2.10"
>>>> >>
>>>> >> assemblyMergeStrategy in assembly := {
>>>> >>   case PathList("META-INF", "MANIFEST.MF")   =>
>>>> >> MergeStrategy.discard
>>>> >>   case PathList("javax", "servlet", xs @ _*) =>
>>>> >> MergeStrategy.first
>>>> >>   case PathList(ps @ _*) if ps.last endsWith ".html" =>
>>>> >> MergeStrategy.first
>>>> >>   case "application.conf"=>
>>>> >> MergeStrategy.concat
>>>> >>   case "unwanted.txt"=>
>>>> >> MergeStrategy.discard
>>>> >>
>>>> >>   case x => val oldStrategy = (assemblyMergeStrategy in
>>>> assembly).value
>>>> >> oldStrategy(x)
>>>> >>
>>>> >> }
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty <
>>>> janardhan...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Is there a dataframe version of XGBoost in spark-ml ?.
>>>> >>> Has anyone used sparkxgboost package ?
>>>> >>
>>>> >>
>>>> >
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Difference between UDF and Transformer in Spark ML

2016-09-06 Thread janardhan shetty
Apart from creation of a new column what are the other differences between
transformer and an udf in spark ML ?


Re: Spark ML 2.1.0 new features

2016-09-06 Thread janardhan shetty
Thanks Jacek.

On Tue, Sep 6, 2016 at 1:44 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> https://issues.apache.org/jira/browse/SPARK-17363?jql=
> project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0%
> 20AND%20component%20%3D%20MLlib
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Sep 6, 2016 at 10:27 PM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Any links ?
> >
> > On Mon, Sep 5, 2016 at 1:50 PM, janardhan shetty <janardhan...@gmail.com
> >
> > wrote:
> >>
> >> Is there any documentation or links on the new features which we can
> >> expect for Spark ML 2.1.0 release ?
> >
> >
>


Re: Spark transformations

2016-09-06 Thread janardhan shetty
Noticed few things about Spark transformers just wanted to be clear.

Unary transformer:

createTransformFunc: IN => OUT  = { *item* => }
Here *item *is single element and *NOT* entire column.

I would like to get the number of elements in that particular column. Since
there is *no forward checking* how can we get this information ?
We have visibility into single element and not the entire column.










On Sun, Sep 4, 2016 at 9:30 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> In scala Spark ML Dataframes.
>
> On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <somasundar.sekar@
> tigeranalytics.com> wrote:
>
>> Can you try this
>>
>> https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-
>> examples-gaurav-singh
>>
>> On 4 Sep 2016 9:38 pm, "janardhan shetty" <janardhan...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is there any chance that we can send entire multiple columns to an udf
>>> and generate a new column for Spark ML.
>>> I see similar approach as VectorAssembler but not able to use few
>>> classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since
>>> they are private.
>>>
>>> Any leads/examples is appreciated in this regard..
>>>
>>> Requirement:
>>> *Input*: Multiple columns of a Dataframe
>>> *Output*:  Single new modified column
>>>
>>
>


Re: Spark ML 2.1.0 new features

2016-09-06 Thread janardhan shetty
Any links ?

On Mon, Sep 5, 2016 at 1:50 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Is there any documentation or links on the new features which we can
> expect for Spark ML 2.1.0 release ?
>


Re: Using spark package XGBoost

2016-09-06 Thread janardhan shetty
Is this merged to Spark ML ? If so which version ?

On Tue, Sep 6, 2016 at 12:58 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> Sorry to bother you, but I'ld like to inform you our activities.
> We'll start incubating our product, Hivemall, in Apache and this is a
> scalable ML library
> for Hive/Spark/Pig.
>
> - http://wiki.apache.org/incubator/HivemallProposal
> - http://markmail.org/thread/mjwyyd4btthk3626
>
> I made a pr for XGBoost integration on DataFrame/Spark(https://
> github.com/myui/hivemall/pull/281)
> and this pr has already been merged in a master.
> I wrote how to use the integration on my gist:
> https://gist.github.com/maropu/33794b293ee937e99b8fb0788843fa3f
>
> If you are interested in the integration, could you please you try using
> it and
> let me know the issues that you get stuck in?
>
> Best regards,
> takeshi
>
> // maropu
>
>
>
> On Mon, Aug 15, 2016 at 1:04 PM, Brandon White <bwwintheho...@gmail.com>
> wrote:
>
>> The XGBoost integration with Spark is currently only supported for RDDs,
>> there is a ticket for dataframe and folks calm to be working on it.
>>
>> On Aug 14, 2016 8:15 PM, "Jacek Laskowski" <ja...@japila.pl> wrote:
>>
>>> Hi,
>>>
>>> I've never worked with the library and speaking about sbt setup only.
>>>
>>> It appears that the project didn't release 2.11-compatible jars (only
>>> 2.10) [1] so you need to build the project yourself and uber-jar it
>>> (using sbt-assembly plugin).
>>>
>>> [1] https://spark-packages.org/package/rotationsymmetry/sparkxgboost
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Sun, Aug 14, 2016 at 7:13 AM, janardhan shetty
>>> <janardhan...@gmail.com> wrote:
>>> > Any leads how to do acheive this?
>>> >
>>> > On Aug 12, 2016 6:33 PM, "janardhan shetty" <janardhan...@gmail.com>
>>> wrote:
>>> >>
>>> >> I tried using  sparkxgboost package in build.sbt file but it failed.
>>> >> Spark 2.0
>>> >> Scala 2.11.8
>>> >>
>>> >> Error:
>>> >>  [warn]
>>> >> http://dl.bintray.com/spark-packages/maven/rotationsymmetry/
>>> sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar
>>> >>[warn] ::
>>> >>[warn] ::  FAILED DOWNLOADS::
>>> >>[warn] :: ^ see resolution messages for details  ^ ::
>>> >>[warn] ::
>>> >>[warn] ::
>>> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(src)
>>> >>[warn] ::
>>> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(doc)
>>> >>
>>> >> build.sbt:
>>> >>
>>> >> scalaVersion := "2.11.8"
>>> >>
>>> >> libraryDependencies ++= {
>>> >>   val sparkVersion = "2.0.0-preview"
>>> >>   Seq(
>>> >> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>>> >> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>>> >> "org.apache.spark" %% "spark-streaming" % sparkVersion %
>>> "provided",
>>> >> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
>>> >>   )
>>> >> }
>>> >>
>>> >> resolvers += "Spark Packages Repo" at
>>> >> "http://dl.bintray.com/spark-packages/maven;
>>> >>
>>> >> libraryDependencies += "rotationsymmetry" % "sparkxgboost" %
>>> >> "0.2.1-s_2.10"
>>> >>
>>> >> assemblyMergeStrategy in assembly := {
>>> >>   case PathList("META-INF", "MANIFEST.MF")   =>
>>> >> MergeStrategy.discard
>>> >>   case PathList("javax", "servlet", xs @ _*) =>
>>> >> MergeStrategy.first
>>> >>   case PathList(ps @ _*) if ps.last endsWith ".html" =>
>>> >> MergeStrategy.first
>>> >>   case "application.conf"=>
>>> >> MergeStrategy.concat
>>> >>   case "unwanted.txt"=>
>>> >> MergeStrategy.discard
>>> >>
>>> >>   case x => val oldStrategy = (assemblyMergeStrategy in
>>> assembly).value
>>> >> oldStrategy(x)
>>> >>
>>> >> }
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty <
>>> janardhan...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Is there a dataframe version of XGBoost in spark-ml ?.
>>> >>> Has anyone used sparkxgboost package ?
>>> >>
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Spark ML 2.1.0 new features

2016-09-05 Thread janardhan shetty
Is there any documentation or links on the new features which we can expect
for Spark ML 2.1.0 release ?


Re: Spark transformations

2016-09-04 Thread janardhan shetty
In scala Spark ML Dataframes.

On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com> wrote:

> Can you try this
>
> https://www.linkedin.com/pulse/hive-functions-udfudaf-
> udtf-examples-gaurav-singh
>
> On 4 Sep 2016 9:38 pm, "janardhan shetty" <janardhan...@gmail.com> wrote:
>
>> Hi,
>>
>> Is there any chance that we can send entire multiple columns to an udf
>> and generate a new column for Spark ML.
>> I see similar approach as VectorAssembler but not able to use few classes
>> /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
>> are private.
>>
>> Any leads/examples is appreciated in this regard..
>>
>> Requirement:
>> *Input*: Multiple columns of a Dataframe
>> *Output*:  Single new modified column
>>
>


Spark transformations

2016-09-04 Thread janardhan shetty
Hi,

Is there any chance that we can send entire multiple columns to an udf and
generate a new column for Spark ML.
I see similar approach as VectorAssembler but not able to use few classes
/traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they
are private.

Any leads/examples is appreciated in this regard..

Requirement:
*Input*: Multiple columns of a Dataframe
*Output*:  Single new modified column


Re: Combining multiple models in Spark-ML 2.0

2016-08-23 Thread janardhan shetty
Any methods to achieve this?
On Aug 22, 2016 3:40 PM, "janardhan shetty" <janardhan...@gmail.com> wrote:

> Hi,
>
> Are there any pointers, links on stacking multiple models in spark
> dataframes ?. WHat strategies can be employed if we need to combine greater
> than 2 models  ?
>


Combining multiple models in Spark-ML 2.0

2016-08-22 Thread janardhan shetty
Hi,

Are there any pointers, links on stacking multiple models in spark
dataframes ?. WHat strategies can be employed if we need to combine greater
than 2 models  ?


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread janardhan shetty
thanks Nick.
This Jira seems to be in stagnant state for a while any update when this
will be released ?

On Mon, Aug 22, 2016 at 5:07 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I believe it may be because of this issue (https://issues.apache.org/
> jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where
> the number of categories differ between train and test, it's not usable in
> the current form.
>
> It's tricky to work around, though one option is to use feature hashing
> instead of the StringIndexer -> OHE combo (see https://lists.apache.org/
> thread.html/a7e06426fd958665985d2c4218ea2f9bf9ba136ddefe83e1ad6f1727@%
> 3Cuser.spark.apache.org%3E for some details).
>
>
>
> On Mon, 22 Aug 2016 at 03:20 janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Thanks Krishna for your response.
>> Features in the training set has more categories than test set so when
>> vectorAssembler is used these numbers are usually different and I believe
>> it is as expected right ?
>>
>> Test dataset usually will not have so many categories in their features
>> as Train is the belief here.
>>
>> On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <ksanka...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>Just after I sent the mail, I realized that the error might be with
>>> the training-dataset not the test-dataset.
>>>
>>>1. it might be that you are feeding the full Y vector for training.
>>>2. Which could mean, you are using ~50-50 training-test split.
>>>3. Take a good look at the code that does the data split and the
>>>datasets where they are allocated to.
>>>
>>> Cheers
>>> 
>>>
>>> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>   Looks like the test-dataset has different sizes for X & Y. Possible
>>>> steps:
>>>>
>>>>1. What is the test-data-size ?
>>>>   - If it is 15,909, check the prediction variable vector - it is
>>>>   now 29,471, should be 15,909
>>>>   - If you expect it to be 29,471, then the X Matrix is not right.
>>>>   2. It is also probable that the size of the test-data is
>>>>something else. If so, check the data pipeline.
>>>>3. If you print the count() of the various vectors, I think you can
>>>>find the error.
>>>>
>>>> Cheers & Good Luck
>>>> 
>>>>
>>>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <
>>>> janardhan...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have built the logistic regression model using training-dataset.
>>>>> When I am predicting on a test-dataset, it is throwing the below error
>>>>> of size mismatch.
>>>>>
>>>>> Steps done:
>>>>> 1. String indexers on categorical features.
>>>>> 2. One hot encoding on these indexed features.
>>>>>
>>>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>>>
>>>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>>>> sizes: x.size = 15909, y.size = 29471* at
>>>>> scala.Predef$.require(Predef.scala:224) at 
>>>>> org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104)
>>>>> at org.apache.spark.ml.classification.LogisticRegressionModel$$
>>>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml.
>>>>> classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>>>> at org.apache.spark.ml.classification.LogisticRegressionModel.
>>>>> predictRaw(LogisticRegression.scala:594) at org.apache.spark.ml.
>>>>> classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>>>> at org.apache.spark.ml.classification.ProbabilisticClassificationMod
>>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>>>> org.apache.spark.ml.classification.ProbabilisticClassificationMod
>>>>> el$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.evalExpr137$(Unknown Source) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.apply(Unknown Source) at
>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>>>>> SpecificUnsafeProjection.apply(Unknown Source) at
>>>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>>>
>>>>
>>>>
>>>
>>


Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Thanks Krishna for your response.
Features in the training set has more categories than test set so when
vectorAssembler is used these numbers are usually different and I believe
it is as expected right ?

Test dataset usually will not have so many categories in their features as
Train is the belief here.

On Sun, Aug 21, 2016 at 4:44 PM, Krishna Sankar <ksanka...@gmail.com> wrote:

> Hi,
>Just after I sent the mail, I realized that the error might be with the
> training-dataset not the test-dataset.
>
>1. it might be that you are feeding the full Y vector for training.
>2. Which could mean, you are using ~50-50 training-test split.
>3. Take a good look at the code that does the data split and the
>datasets where they are allocated to.
>
> Cheers
> 
>
> On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com>
> wrote:
>
>> Hi,
>>   Looks like the test-dataset has different sizes for X & Y. Possible
>> steps:
>>
>>1. What is the test-data-size ?
>>   - If it is 15,909, check the prediction variable vector - it is
>>   now 29,471, should be 15,909
>>   - If you expect it to be 29,471, then the X Matrix is not right.
>>   2. It is also probable that the size of the test-data is something
>>else. If so, check the data pipeline.
>>3. If you print the count() of the various vectors, I think you can
>>find the error.
>>
>> Cheers & Good Luck
>> 
>>
>> On Sun, Aug 21, 2016 at 3:16 PM, janardhan shetty <janardhan...@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I have built the logistic regression model using training-dataset.
>>> When I am predicting on a test-dataset, it is throwing the below error
>>> of size mismatch.
>>>
>>> Steps done:
>>> 1. String indexers on categorical features.
>>> 2. One hot encoding on these indexed features.
>>>
>>> Any help is appreciated to resolve this issue or is it a bug ?
>>>
>>> SparkException: *Job aborted due to stage failure: Task 0 in stage
>>> 635.0 failed 1 times, most recent failure: Lost task 0.0 in stage 635.0
>>> (TID 19421, localhost): java.lang.IllegalArgumentException: requirement
>>> failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching
>>> sizes: x.size = 15909, y.size = 29471* at 
>>> scala.Predef$.require(Predef.scala:224)
>>> at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
>>> org.apache.spark.ml.classification.LogisticRegressionModel$$
>>> anonfun$19.apply(LogisticRegression.scala:505) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
>>> at org.apache.spark.ml.classification.LogisticRegressionModel.p
>>> redictRaw(LogisticRegression.scala:594) at org.apache.spark.ml
>>> .classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
>>> at org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112) at
>>> org.apache.spark.ml.classification.ProbabilisticClassificati
>>> onModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.evalExpr137$(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Spe
>>> cificUnsafeProjection.apply(Unknown Source) at
>>> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>>
>>
>>
>


Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread janardhan shetty
Hi,

I have built the logistic regression model using training-dataset.
When I am predicting on a test-dataset, it is throwing the below error of
size mismatch.

Steps done:
1. String indexers on categorical features.
2. One hot encoding on these indexed features.

Any help is appreciated to resolve this issue or is it a bug ?

SparkException: *Job aborted due to stage failure: Task 0 in stage 635.0
failed 1 times, most recent failure: Lost task 0.0 in stage 635.0 (TID
19421, localhost): java.lang.IllegalArgumentException: requirement failed:
BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:
x.size = 15909, y.size = 29471* at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.BLAS$.dot(BLAS.scala:104) at
org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:505)
at
org.apache.spark.ml.classification.LogisticRegressionModel$$anonfun$19.apply(LogisticRegression.scala:504)
at
org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:594)
at
org.apache.spark.ml.classification.LogisticRegressionModel.predictRaw(LogisticRegression.scala:484)
at
org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:112)
at
org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$1.apply(ProbabilisticClassifier.scala:111)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr137$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)


Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread janardhan shetty
There is a spark-ts package developed by Sandy which has rdd version.
Not sure about the dataframe roadmap.

http://sryza.github.io/spark-timeseries/0.3.0/index.html
On Aug 18, 2016 12:42 AM, "ayan guha"  wrote:

> Thanks a lot. I resolved it using an UDF.
>
> Qs: does spark support any time series model? Is there any roadmap to know
> when a feature will be roughly available?
> On 18 Aug 2016 16:46, "Yanbo Liang"  wrote:
>
>> If you want to tie them with other data, I think the best way is to use
>> DataFrame join operation on condition that they share an identity column.
>>
>> Thanks
>> Yanbo
>>
>> 2016-08-16 20:39 GMT-07:00 ayan guha :
>>
>>> Hi
>>>
>>> Thank you for your reply. Yes, I can get prediction and original
>>> features together. My question is how to tie them back to other parts of
>>> the data, which was not in LP.
>>>
>>> For example, I have a bunch of other dimensions which are not part of
>>> features or label.
>>>
>>> Sorry if this is a stupid question.
>>>
>>> On Wed, Aug 17, 2016 at 12:57 PM, Yanbo Liang 
>>> wrote:
>>>
 MLlib will keep the original dataset during transformation, it just
 append new columns to existing DataFrame. That is you can get both
 prediction value and original features from the output DataFrame of
 model.transform.

 Thanks
 Yanbo

 2016-08-16 17:48 GMT-07:00 ayan guha :

> Hi
>
> I have a dataset as follows:
>
> DF:
> amount:float
> date_read:date
> meter_number:string
>
> I am trying to predict future amount based on past 3 weeks consumption
> (and a heaps of weather data related to date).
>
> My Labelpoint looks like
>
> label (populated from DF.amount)
> features (populated from a bunch of other stuff)
>
> Model.predict output:
> label
> prediction
>
> Now, I am trying to put together this prediction value back to meter
> number and date_read from original DF?
>
> One way to assume order of records in DF and Model.predict will be
> exactly same and zip two RDDs. But any other (possibly better) solution?
>
> --
> Best Regards,
> Ayan Guha
>


>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>


Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
I had already tried this way :

scala> val featureCols = Array("category","newone")
featureCols: Array[String] = Array(category, newone)

scala>  val indexer = new
StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)
:29: error: type mismatch;
 found   : Array[String]
 required: String
val indexer = new
StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)


On Wed, Aug 17, 2016 at 10:56 AM, Nisha Muktewar <ni...@cloudera.com> wrote:

> I don't think it does. From the documentation:
> https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder,
> I see that it still accepts one column at a time.
>
> On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty <janardhan...@gmail.com
> > wrote:
>
>> 2.0:
>>
>> One hot encoding currently accepts single input column is there a way to
>> include multiple columns ?
>>
>
>


Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
2.0:

One hot encoding currently accepts single input column is there a way to
include multiple columns ?


Re: Using spark package XGBoost

2016-08-14 Thread janardhan shetty
Any leads how to do acheive this?
On Aug 12, 2016 6:33 PM, "janardhan shetty" <janardhan...@gmail.com> wrote:

> I tried using  *sparkxgboost package *in build.sbt file but it failed.
> Spark 2.0
> Scala 2.11.8
>
> Error:
>  [warn]   http://dl.bintray.com/spark-packages/maven/
> rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.
> 1-s_2.10-javadoc.jar
>[warn] ::
>[warn] ::  FAILED DOWNLOADS::
>[warn] :: ^ see resolution messages for details  ^ ::
>[warn] ::
>[warn] :: rotationsymmetry#sparkxgboost;
> 0.2.1-s_2.10!sparkxgboost.jar(src)
>[warn] :: rotationsymmetry#sparkxgboost;
> 0.2.1-s_2.10!sparkxgboost.jar(doc)
>
> build.sbt:
>
> scalaVersion := "2.11.8"
>
> libraryDependencies ++= {
>   val sparkVersion = "2.0.0-preview"
>   Seq(
> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
>   )
> }
>
>
>
> *resolvers += "Spark Packages Repo" at
> "http://dl.bintray.com/spark-packages/maven
> <http://dl.bintray.com/spark-packages/maven>"libraryDependencies +=
> "rotationsymmetry" % "sparkxgboost" % "0.2.1-s_2.10"*
>
> assemblyMergeStrategy in assembly := {
>   case PathList("META-INF", "MANIFEST.MF")   =>
> MergeStrategy.discard
>   case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
>   case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
>   case "application.conf"=>
> MergeStrategy.concat
>   case "unwanted.txt"=>
> MergeStrategy.discard
>
>   case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
>
> }
>
>
>
>
> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Is there a dataframe version of XGBoost in spark-ml ?.
>> Has anyone used sparkxgboost package ?
>>
>
>


Re: Using spark package XGBoost

2016-08-12 Thread janardhan shetty
I tried using  *sparkxgboost package *in build.sbt file but it failed.
Spark 2.0
Scala 2.11.8

Error:
 [warn]
http://dl.bintray.com/spark-packages/maven/rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar
   [warn] ::
   [warn] ::  FAILED DOWNLOADS::
   [warn] :: ^ see resolution messages for details  ^ ::
   [warn] ::
   [warn] ::
rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(src)
   [warn] ::
rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(doc)

build.sbt:

scalaVersion := "2.11.8"

libraryDependencies ++= {
  val sparkVersion = "2.0.0-preview"
  Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
  )
}



*resolvers += "Spark Packages Repo" at
"http://dl.bintray.com/spark-packages/maven
<http://dl.bintray.com/spark-packages/maven>"libraryDependencies +=
"rotationsymmetry" % "sparkxgboost" % "0.2.1-s_2.10"*

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", "MANIFEST.MF")   =>
MergeStrategy.discard
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
  case "application.conf"        => MergeStrategy.concat
  case "unwanted.txt"=>
MergeStrategy.discard

  case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)

}




On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Is there a dataframe version of XGBoost in spark-ml ?.
> Has anyone used sparkxgboost package ?
>


Using spark package XGBoost

2016-08-12 Thread janardhan shetty
Is there a dataframe version of XGBoost in spark-ml ?.
Has anyone used sparkxgboost package ?


Re: Symbol HasInputCol is inaccesible from this place

2016-08-08 Thread janardhan shetty
Can some experts shed light on this one? Still facing issues with extends
HasInputCol and DefaultParamsWritable

On Mon, Aug 8, 2016 at 9:56 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> you mean is it deprecated ?
>
> On Mon, Aug 8, 2016 at 5:02 AM, Strange, Nick <nick.stra...@fmr.com>
> wrote:
>
>> What possible reason do they have to think its fragmentation?
>>
>>
>>
>> *From:* janardhan shetty [mailto:janardhan...@gmail.com]
>> *Sent:* Saturday, August 06, 2016 2:01 PM
>> *To:* Ted Yu
>> *Cc:* user
>> *Subject:* Re: Symbol HasInputCol is inaccesible from this place
>>
>>
>>
>> Yes seems like, wondering if this can be made public in order to develop
>> custom transformers or any other alternatives ?
>>
>>
>>
>> On Sat, Aug 6, 2016 at 10:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> Is it because HasInputCol is private ?
>>
>>
>>
>> private[ml] trait HasInputCol extends Params {
>>
>>
>>
>> On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty <janardhan...@gmail.com>
>> wrote:
>>
>> Version : 2.0.0-preview
>>
>>
>> import org.apache.spark.ml.param._
>> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>>
>>
>> class CustomTransformer(override val uid: String) extends Transformer
>> with HasInputCol with HasOutputCol with DefaultParamsWritableimport
>> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>> HasInputCol, HasOutputCol}
>>
>> *Error in IntelliJ *
>> Symbol HasInputCol is inaccessible from this place
>>
>>  similairly for HasOutputCol and DefaultParamsWritable
>>
>> Any thoughts on this error as it is not allowing the compile
>>
>>
>>
>>
>>
>>
>>
>
>


Re: Symbol HasInputCol is inaccesible from this place

2016-08-08 Thread janardhan shetty
you mean is it deprecated ?

On Mon, Aug 8, 2016 at 5:02 AM, Strange, Nick <nick.stra...@fmr.com> wrote:

> What possible reason do they have to think its fragmentation?
>
>
>
> *From:* janardhan shetty [mailto:janardhan...@gmail.com]
> *Sent:* Saturday, August 06, 2016 2:01 PM
> *To:* Ted Yu
> *Cc:* user
> *Subject:* Re: Symbol HasInputCol is inaccesible from this place
>
>
>
> Yes seems like, wondering if this can be made public in order to develop
> custom transformers or any other alternatives ?
>
>
>
> On Sat, Aug 6, 2016 at 10:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Is it because HasInputCol is private ?
>
>
>
> private[ml] trait HasInputCol extends Params {
>
>
>
> On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
> Version : 2.0.0-preview
>
>
> import org.apache.spark.ml.param._
> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>
>
> class CustomTransformer(override val uid: String) extends Transformer with
> HasInputCol with HasOutputCol with DefaultParamsWritableimport
> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
> HasInputCol, HasOutputCol}
>
> *Error in IntelliJ *
> Symbol HasInputCol is inaccessible from this place
>
>  similairly for HasOutputCol and DefaultParamsWritable
>
> Any thoughts on this error as it is not allowing the compile
>
>
>
>
>
>
>


Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread janardhan shetty
Can you try 'or' keyword instead?
On Aug 7, 2016 7:43 AM, "Divya Gehlot"  wrote:

> Hi,
> I have use case where I need to use or[||] operator in filter condition.
> It seems its not working its taking the condition before the operator and
> ignoring the other filter condition after or operator.
> As any body faced similar issue .
>
> Psuedo code :
> df.filter(col("colName").notEqual("no_value") ||
> col("colName").notEqual(""))
>
> Am I missing something.
> Would really appreciate the help.
>
>
> Thanks,
> Divya
>


Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread janardhan shetty
Yes seems like, wondering if this can be made public in order to develop
custom transformers or any other alternatives ?

On Sat, Aug 6, 2016 at 10:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Is it because HasInputCol is private ?
>
> private[ml] trait HasInputCol extends Params {
>
> On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Version : 2.0.0-preview
>>
>> import org.apache.spark.ml.param._
>> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>>
>>
>> class CustomTransformer(override val uid: String) extends Transformer
>> with HasInputCol with HasOutputCol with DefaultParamsWritableimport
>> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>> HasInputCol, HasOutputCol}
>>
>> *Error in IntelliJ *
>> Symbol HasInputCol is inaccessible from this place
>>  similairly for HasOutputCol and DefaultParamsWritable
>>
>> Any thoughts on this error as it is not allowing the compile
>>
>>
>>
>


Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread janardhan shetty
Any thoughts or suggestions on this error?

On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Version : 2.0.0-preview
>
> import org.apache.spark.ml.param._
> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>
>
> class CustomTransformer(override val uid: String) extends Transformer with
> HasInputCol with HasOutputCol with DefaultParamsWritableimport
> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
> HasInputCol, HasOutputCol}
>
> *Error in IntelliJ *
> Symbol HasInputCol is inaccessible from this place
>  similairly for HasOutputCol and DefaultParamsWritable
>
> Any thoughts on this error as it is not allowing the compile
>
>
>


Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread janardhan shetty
Mike,

Any suggestions on doing it for consequitive id's?
On Aug 5, 2016 9:08 AM, "Tony Lane"  wrote:

> Mike.
>
> I have figured how to do this .  Thanks for the suggestion. It works
> great.  I am trying to figure out the performance impact of this.
>
> thanks again
>
>
> On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane  wrote:
>
>> @mike  - this looks great. How can i do this in java ?   what is the
>> performance implication on a large dataset  ?
>>
>> @sonal  - I can't have a collision in the values.
>>
>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger 
>> wrote:
>>
>>> You can use the monotonically_increasing_id method to generate
>>> guaranteed unique (but not necessarily consecutive) IDs.  Calling something
>>> like:
>>>
>>> df.withColumn("id", monotonically_increasing_id())
>>>
>>> You don't mention which language you're using but you'll need to pull in
>>> the sql.functions library.
>>>
>>> Mike
>>>
>>> On Aug 5, 2016, at 9:11 AM, Tony Lane  wrote:
>>>
>>> Ayan - basically i have a dataset with structure, where bid are unique
>>> string values
>>>
>>> bid: String
>>> val : integer
>>>
>>> I need unique int values for these string bid''s to do some processing
>>> in the dataset
>>>
>>> like
>>>
>>> id:int   (unique integer id for each bid)
>>> bid:String
>>> val:integer
>>>
>>>
>>>
>>> -Tony
>>>
>>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha  wrote:
>>>
 Hi

 Can you explain a little further?

 best
 Ayan

 On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane 
 wrote:

> I have a row with structure like
>
> identifier: String
> value: int
>
> All identifier are unique and I want to generate a unique long id for
> the data and get a row object back for further processing.
>
> I understand using the zipWithUniqueId function on RDD, but that would
> mean first converting to RDD and then joining back the RDD and dataset
>
> What is the best way to do this ?
>
> -Tony
>
>


 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>
>


Symbol HasInputCol is inaccesible from this place

2016-08-04 Thread janardhan shetty
Version : 2.0.0-preview

import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}


class CustomTransformer(override val uid: String) extends Transformer with
HasInputCol with HasOutputCol with DefaultParamsWritableimport
org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
HasInputCol, HasOutputCol}

*Error in IntelliJ *
Symbol HasInputCol is inaccessible from this place
 similairly for HasOutputCol and DefaultParamsWritable

Any thoughts on this error as it is not allowing the compile


Re: decribe function limit of columns

2016-08-02 Thread janardhan shetty
If you are referring to limit the # of columns you can select the columns
and describe.
df.select("col1", "col2").describe().show()

On Tue, Aug 2, 2016 at 6:39 AM, pseudo oduesp  wrote:

> Hi
>  in spark 1.5.0  i used  descibe function with more than 100 columns .
> someone can tell me if any limit exsiste now ?
>
> thanks
>
>


Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-08-01 Thread janardhan shetty
What is the difference between UnaryTransformer and Transformer classes. In
which scenarios should we use  one or the other ?

On Sun, Jul 31, 2016 at 8:27 PM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Developing in scala but any help with difference between UnaryTransformer
> (Is this experimental still ?)and Transformer class is appreciated.
>
> Right now encountering  error for the code which extends UnaryTransformer
>
> override protected def outputDataType: DataType = new StringType
>
> Error:(26, 53) constructor StringType in class StringType cannot be accessed 
> in class Capitalizer
>   override protected def outputDataType: DataType = new StringType
> ^
>
>
>
> On Thu, Jul 28, 2016 at 8:20 PM, Phuong LE-HONG <phuon...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I've developed a simple ML estimator (in Java) that implements
>> conditional Markov model for sequence labelling in Vitk toolkit. You
>> can check it out here:
>>
>>
>> https://github.com/phuonglh/vn.vitk/blob/master/src/main/java/vn/vitk/tag/CMM.java
>>
>> Phuong Le-Hong
>>
>> On Fri, Jul 29, 2016 at 9:01 AM, janardhan shetty
>> <janardhan...@gmail.com> wrote:
>> > Thanks Steve.
>> >
>> > Any pointers to custom estimators development as well ?
>> >
>> > On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe <sar...@gmail.com> wrote:
>> >>
>> >> You can see the source for my transformer configurable bridge to Lucene
>> >> analysis components here, in my company Lucidworks’ spark-solr project:
>> >> <
>> https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/ml/feature/LuceneTextAnalyzerTransformer.scala
>> >.
>> >>
>> >> Here’s a blog I wrote about using this transformer, as well as
>> >> non-ML-context use in Spark of the underlying analysis component, here:
>> >> <https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/
>> >.
>> >>
>> >> --
>> >> Steve
>> >> www.lucidworks.com
>> >>
>> >> > On Jul 27, 2016, at 1:31 PM, janardhan shetty <
>> janardhan...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
>> >> >
>> >> > 2. Any links or blogs to develop custom estimators ? ex: any ml
>> >> > algorithm
>> >>
>> >
>>
>
>


Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-31 Thread janardhan shetty
Developing in scala but any help with difference between UnaryTransformer
(Is this experimental still ?)and Transformer class is appreciated.

Right now encountering  error for the code which extends UnaryTransformer

override protected def outputDataType: DataType = new StringType

Error:(26, 53) constructor StringType in class StringType cannot be
accessed in class Capitalizer
  override protected def outputDataType: DataType = new StringType
^



On Thu, Jul 28, 2016 at 8:20 PM, Phuong LE-HONG <phuon...@gmail.com> wrote:

> Hi,
>
> I've developed a simple ML estimator (in Java) that implements
> conditional Markov model for sequence labelling in Vitk toolkit. You
> can check it out here:
>
>
> https://github.com/phuonglh/vn.vitk/blob/master/src/main/java/vn/vitk/tag/CMM.java
>
> Phuong Le-Hong
>
> On Fri, Jul 29, 2016 at 9:01 AM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Thanks Steve.
> >
> > Any pointers to custom estimators development as well ?
> >
> > On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe <sar...@gmail.com> wrote:
> >>
> >> You can see the source for my transformer configurable bridge to Lucene
> >> analysis components here, in my company Lucidworks’ spark-solr project:
> >> <
> https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/ml/feature/LuceneTextAnalyzerTransformer.scala
> >.
> >>
> >> Here’s a blog I wrote about using this transformer, as well as
> >> non-ML-context use in Spark of the underlying analysis component, here:
> >> <https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/
> >.
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >> > On Jul 27, 2016, at 1:31 PM, janardhan shetty <janardhan...@gmail.com
> >
> >> > wrote:
> >> >
> >> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
> >> >
> >> > 2. Any links or blogs to develop custom estimators ? ex: any ml
> >> > algorithm
> >>
> >
>


Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-28 Thread janardhan shetty
Thanks Steve.

Any pointers to custom estimators development as well ?

On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe <sar...@gmail.com> wrote:

> You can see the source for my transformer configurable bridge to Lucene
> analysis components here, in my company Lucidworks’ spark-solr project: <
> https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/ml/feature/LuceneTextAnalyzerTransformer.scala
> >.
>
> Here’s a blog I wrote about using this transformer, as well as
> non-ML-context use in Spark of the underlying analysis component, here: <
> https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/>.
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 27, 2016, at 1:31 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
> >
> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
> >
> > 2. Any links or blogs to develop custom estimators ? ex: any ml algorithm
>
>


Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread janardhan shetty
Seems like parquet format is better comparatively to orc when the dataset
is log data without nested structures? Is this fair understanding ?
On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:

> Kudu has been from my impression be designed to offer somethings between
> hbase and parquet for write intensive loads - it is not faster for
> warehouse type of querying compared to parquet (merely slower, because that
> is not its use case).   I assume this is still the strategy of it.
>
> For some scenarios it could make sense together with parquet and Orc.
> However I am not sure what the advantage towards using hbase + parquet and
> Orc.
>
> On 27 Jul 2016, at 11:47, "u...@moosheimer.com <u...@moosheimer.com>" <
> u...@moosheimer.com <u...@moosheimer.com>> wrote:
>
> Hi Gourav,
>
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in
> memory db with data storage while Parquet is "only" a columnar
> storage format.
>
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ...
> that's more a wish :-).
>
> Regards,
> Uwe
>
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
>
> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com
> >:
>
> Gosh,
>
> whether ORC came from this or that, it runs queries in HIVE with TEZ at a
> speed that is better than SPARK.
>
> Has anyone heard of KUDA? Its better than Parquet. But I think that
> someone might just start saying that KUDA has difficult lineage as well.
> After all dynastic rules dictate.
>
> Personally I feel that if something stores my data compressed and makes me
> access it faster I do not care where it comes from or how difficult the
> child birth was :)
>
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Just correction:
>>
>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization
>> default.
>>
>> Do not know If Spark leveraging this new repo?
>>
>> 
>>  org.apache.orc
>> orc
>> 1.1.2
>> pom
>> 
>>
>>
>>
>>
>>
>>
>>
>>
>> Sent from my iPhone
>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> parquet was inspired by dremel but written from the ground up as a
>> library with support for a variety of big data systems (hive, pig, impala,
>> cascading, etc.). it is also easy to add new support, since its a proper
>> library.
>>
>> orc bas been enhanced while deployed at facebook in hive and at yahoo in
>> hive. just hive. it didn't really exist by itself. it was part of the big
>> java soup that is called hive, without an easy way to extract it. hive does
>> not expose proper java apis. it never cared for that.
>>
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> Interesting opinion, thank you
>>>
>>> Still, on the website parquet is basically inspired by Dremel (Google)
>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo
>>> [2].
>>>
>>> Other than this presentation [3], do you guys know any other benchmark?
>>>
>>> [1]https://parquet.apache.org/documentation/latest/
>>> [2]https://orc.apache.org/docs/
>>> [3]
>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> when parquet came out it was developed by a community of companies, and
>>> was designed as a library to be supported by multiple big data projects.
>>> nice
>>>
>>> orc on the other hand initially only supported hive. it wasn't even
>>> designed as a library that can be re-used. even today it brings in the
>>> kitchen sink of transitive dependencies. yikes
>>>
>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>
>>>> I think both are very similar, but with slightly different goals. While
>>>> they work transparently for each Hadoop application you need to enable
>>>> specific support in the application for predicate push down.
>>>> In the end you have to check which application you are using and do
>>>> some tests (with correct predicate push down configuration). Keep in mind
>>>> that both formats work best if they are sor

Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-27 Thread janardhan shetty
1.  Any links or blogs to develop *custom* transformers ? ex: Tokenizer

2. Any links or blogs to develop *custom* estimators ? ex: any ml algorithm


Re: Maintaining order of pair rdd

2016-07-26 Thread janardhan shetty
Let me provide step wise details:

1.
I have an RDD  = {
(ID2,18159) - *element 1  *
(ID1,18159) - *element 2*
(ID3,18159) - *element 3*
(ID2,36318) - *element 4 *
(ID1,36318) - *element 5*
(ID3,36318)
(ID2,54477)
(ID1,54477)
(ID3,54477)
}

2. RDD.groupByKey().mapValues(v => v.toArray())

Array(
(ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, 145272,
100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703,
160992, 45431, 162076)),
(ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
45431, *36318*, 162076))
)


whereas in Step 2 I need as below:

Array(
(ID1,Array(*18159*,*36318*, *54477,...*)),
(ID3,Array(*18159*,*36318*, *54477, ...*)),
(ID2,Array(*18159*,*36318*, *54477, ...*))
)

Does this help ?

On Tue, Jul 26, 2016 at 2:25 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Apologies janardhan, i always get confused on this
> Ok. so you have a  (key, val) RDD (val is irrelevant here)
>
> then you can do this
> val reduced = myRDD.reduceByKey((first, second) => first  ++ second)
>
> val sorted = reduced.sortBy(tpl => tpl._1)
>
> hth
>
>
>
> On Tue, Jul 26, 2016 at 3:31 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> groupBy is a shuffle operation and index is already lost in this process
>> if I am not wrong and don't see *sortWith* operation on RDD.
>>
>> Any suggestions or help ?
>>
>> On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hi
>>>  after you do a groupBy you should use a sortWith.
>>> Basically , a groupBy reduces your structure to (anyone correct me if i
>>> m wrong) a RDD[(key,val)], which you can see as a tuple.so you could
>>> use sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1)
>>> hth
>>>
>>> On Mon, Jul 25, 2016 at 1:21 AM, janardhan shetty <
>>> janardhan...@gmail.com> wrote:
>>>
>>>> Thanks Marco. This solved the order problem. Had another question which
>>>> is prefix to this.
>>>>
>>>> As you can see below ID2,ID1 and ID3 are in order and I need to
>>>> maintain this index order as well. But when we do groupByKey 
>>>> operation(*rdd.distinct.groupByKey().mapValues(v
>>>> => v.toArray*))
>>>> everything is *jumbled*.
>>>> Is there any way we can maintain this order as well ?
>>>>
>>>> scala> RDD.foreach(println)
>>>> (ID2,18159)
>>>> (ID1,18159)
>>>> (ID3,18159)
>>>>
>>>> (ID2,18159)
>>>> (ID1,18159)
>>>> (ID3,18159)
>>>>
>>>> (ID2,36318)
>>>> (ID1,36318)
>>>> (ID3,36318)
>>>>
>>>> (ID2,54477)
>>>> (ID1,54477)
>>>> (ID3,54477)
>>>>
>>>> *Jumbled version : *
>>>> Array(
>>>> (ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*,
>>>> 145272, 100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683,
>>>> 58866, 162076, 45431, 100136)),
>>>> (ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022,
>>>> 39244, 100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*,
>>>> 308703, 160992, 45431, 162076)),
>>>> (ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
>>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
>>>> 45431, *36318*, 162076))
>>>> )
>>>>
>>>> *Expected output:*
>>>> Array(
>>>> (ID1,Array(*18159*,*36318*, *54477,...*)),
>>>> (ID3,Array(*18159*,*36318*, *54477, ...*)),
>>>> (ID2,Array(*18159*,*36318*, *54477, ...*))
>>>> )
>>>>
>>>> As you can see after *groupbyKey* operation is complete item 18519 is
>>>> in index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected
>>>> is index 0
>>>>
>>>>
>>>> On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello
>>>>>  Uhm you have an array containing 3 tuples?
>>>>> If all the arrays have same length, you can just zip all of them,
>>>>> creatings a list of tuples
>>>>> then you can scan the list 5 by 5...?
>>

Re: ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Thanks Timur for the explanation.
What about if the data is  log-data which is delimited(csv or tsv) and
doesn't have too many nestings and are in file formats ?

On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao <t...@timshenkao.su> wrote:

> 1) The opinions on StackOverflow are correct, not biased.
> 2) Cloudera promoted Parquet, Hortonworks - ORC + Tez. When it became
> obvious that just file format is not enough and Impala sucks, then Cloudera
> announced https://vision.cloudera.com/one-platform/ and focused on Spark
> 3) There is a race between ORC & Parquet: after some perfect release ORC
> becomes better & faster, then, several months later, Parquet may outperform.
> 4) If you use "flat" tables --> ORC is better. If you have highly nested
> data with arrays inside of dictionaries (for instance, json that isn't
> flattened) then may be one should choose Parquet
> 5) AFAIK, Parquet has its metadata at the end of the file (correct me if
> something has changed) . It means that Parquet file must be completely read
> & put into RAM. If there is no enough RAM or file somehow is corrupted -->
> problems arise
>
> On Tue, Jul 26, 2016 at 5:09 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Just wondering advantages and disadvantages to convert data into ORC or
>> Parquet.
>>
>> In the documentation of Spark there are numerous examples of Parquet
>> format.
>>
>> Any strong reasons to chose Parquet over ORC file format ?
>>
>> Also : current data compression is bzip2
>>
>>
>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>> This seems like biased.
>>
>
>


Re: Maintaining order of pair rdd

2016-07-25 Thread janardhan shetty
groupBy is a shuffle operation and index is already lost in this process if
I am not wrong and don't see *sortWith* operation on RDD.

Any suggestions or help ?

On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni <mmistr...@gmail.com>
wrote:

> Hi
>  after you do a groupBy you should use a sortWith.
> Basically , a groupBy reduces your structure to (anyone correct me if i m
> wrong) a RDD[(key,val)], which you can see as a tuple.so you could use
> sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1)
> hth
>
> On Mon, Jul 25, 2016 at 1:21 AM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
>> Thanks Marco. This solved the order problem. Had another question which
>> is prefix to this.
>>
>> As you can see below ID2,ID1 and ID3 are in order and I need to maintain
>> this index order as well. But when we do groupByKey 
>> operation(*rdd.distinct.groupByKey().mapValues(v
>> => v.toArray*))
>> everything is *jumbled*.
>> Is there any way we can maintain this order as well ?
>>
>> scala> RDD.foreach(println)
>> (ID2,18159)
>> (ID1,18159)
>> (ID3,18159)
>>
>> (ID2,18159)
>> (ID1,18159)
>> (ID3,18159)
>>
>> (ID2,36318)
>> (ID1,36318)
>> (ID3,36318)
>>
>> (ID2,54477)
>> (ID1,54477)
>> (ID3,54477)
>>
>> *Jumbled version : *
>> Array(
>> (ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*,
>> 145272, 100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683,
>> 58866, 162076, 45431, 100136)),
>> (ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244,
>> 100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703,
>> 160992, 45431, 162076)),
>> (ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
>> 45431, *36318*, 162076))
>> )
>>
>> *Expected output:*
>> Array(
>> (ID1,Array(*18159*,*36318*, *54477,...*)),
>> (ID3,Array(*18159*,*36318*, *54477, ...*)),
>> (ID2,Array(*18159*,*36318*, *54477, ...*))
>> )
>>
>> As you can see after *groupbyKey* operation is complete item 18519 is in
>> index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected is
>> index 0
>>
>>
>> On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hello
>>>  Uhm you have an array containing 3 tuples?
>>> If all the arrays have same length, you can just zip all of them,
>>> creatings a list of tuples
>>> then you can scan the list 5 by 5...?
>>>
>>> so something like
>>>
>>> (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList
>>>
>>> this will give you a list of tuples of 3 elements containing each items
>>> from ID1, ID2 and ID3  ... sample below
>>> res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159,
>>> 89366)..)
>>>
>>> then you can use a recursive function to compare each element such as
>>>
>>> def iterate(lst:List[(Int, Int, Int)]):T = {
>>> if (lst.isEmpty): /// return your comparison
>>> else {
>>>  val splits = lst.splitAt(5)
>>>  // do sometjhing about it using splits._1
>>>  iterate(splits._2)
>>>}
>>>
>>> will this help? or am i still missing something?
>>>
>>> kr
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 24 Jul 2016 5:52 pm, "janardhan shetty" <janardhan...@gmail.com>
>>> wrote:
>>>
>>>> Array(
>>>> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
>>>> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
>>>> 45431, 100136)),
>>>> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
>>>> 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
>>>> 45431, 162076)),
>>>> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866,
>>>> 44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431,
>>>> 36318, 162076))
>>>> )
>>>>
>>>> I need to compare first 5 elements of ID1 with first five element of
>>>> ID3  next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
>>>> order until the end of number of element

ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Just wondering advantages and disadvantages to convert data into ORC or
Parquet.

In the documentation of Spark there are numerous examples of Parquet
format.

Any strong reasons to chose Parquet over ORC file format ?

Also : current data compression is bzip2

http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
This seems like biased.


Re: Bzip2 to Parquet format

2016-07-25 Thread janardhan shetty
Andrew,

2.0

I tried
val inputR = sc.textFile(file)
val inputS = inputR.map(x => x.split("`"))
val inputDF = inputS.toDF()

inputDF.write.format("parquet").save(result.parquet)

Result part files end with *.snappy.parquet *is that expected ?

On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <and...@aehrlich.com> wrote:

> You can load the text with sc.textFile() to an RDD[String], then use
> .map() to convert it into an RDD[Row]. At this point you are ready to
> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>
> Here is an example on how to define the StructType (schema) that you will
> combine with the RDD[Row] to create a DataFrame.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>
> Once you have the DataFrame, save it to parquet with
> dataframe.save(“/path”) to create a parquet file.
>
> Reference for SQLContext / createDataFrame:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>
>
>
> On Jul 24, 2016, at 5:34 PM, janardhan shetty <janardhan...@gmail.com>
> wrote:
>
> We have data in Bz2 compression format. Any links in Spark to convert into
> Parquet and also performance benchmarks and uses study materials ?
>
>
>


Bzip2 to Parquet format

2016-07-24 Thread janardhan shetty
We have data in Bz2 compression format. Any links in Spark to convert into
Parquet and also performance benchmarks and uses study materials ?


K-means Evaluation metrics

2016-07-24 Thread janardhan shetty
Hi,

I was trying to evaluate k-means clustering prediction since the exact
cluster numbers were provided before hand for each data point.
Just tried the Error = Predicted cluster number - Given number as brute
force method.

What are the evaluation metrics available in Spark for K-means clustering
validation to improve?


Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Thanks Marco. This solved the order problem. Had another question which is
prefix to this.

As you can see below ID2,ID1 and ID3 are in order and I need to maintain
this index order as well. But when we do groupByKey
operation(*rdd.distinct.groupByKey().mapValues(v
=> v.toArray*))
everything is *jumbled*.
Is there any way we can maintain this order as well ?

scala> RDD.foreach(println)
(ID2,18159)
(ID1,18159)
(ID3,18159)

(ID2,18159)
(ID1,18159)
(ID3,18159)

(ID2,36318)
(ID1,36318)
(ID3,36318)

(ID2,54477)
(ID1,54477)
(ID3,54477)

*Jumbled version : *
Array(
(ID1,Array(*18159*, 308703, 72636, 64544, 39244, 107937, *54477*, 145272,
100079, *36318*, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, *18159*, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, * 54477*, *36318*, 308703,
160992, 45431, 162076)),
(ID2,Array(308703, * 54477*, 89366, 39244, 150022, 72636, 817, 58866,
44683, 19622, 160992, 107937, 100079, 100136, 145272, 64544, *18159*,
45431, *36318*, 162076))
)

*Expected output:*
Array(
(ID1,Array(*18159*,*36318*, *54477,...*)),
(ID3,Array(*18159*,*36318*, *54477, ...*)),
(ID2,Array(*18159*,*36318*, *54477, ...*))
)

As you can see after *groupbyKey* operation is complete item 18519 is in
index 0 for ID1, index 2 for ID3 and index 16 for ID2 where as expected is
index 0


On Sun, Jul 24, 2016 at 12:43 PM, Marco Mistroni <mmistr...@gmail.com>
wrote:

> Hello
>  Uhm you have an array containing 3 tuples?
> If all the arrays have same length, you can just zip all of them,
> creatings a list of tuples
> then you can scan the list 5 by 5...?
>
> so something like
>
> (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList
>
> this will give you a list of tuples of 3 elements containing each items
> from ID1, ID2 and ID3  ... sample below
> res: List((18159,100079,308703), (308703, 19622, 54477), (72636,18159,
> 89366)..)
>
> then you can use a recursive function to compare each element such as
>
> def iterate(lst:List[(Int, Int, Int)]):T = {
> if (lst.isEmpty): /// return your comparison
> else {
>  val splits = lst.splitAt(5)
>  // do sometjhing about it using splits._1
>  iterate(splits._2)
>}
>
> will this help? or am i still missing something?
>
> kr
>
>
>
>
>
>
>
>
>
>
>
>
> On 24 Jul 2016 5:52 pm, "janardhan shetty" <janardhan...@gmail.com> wrote:
>
>> Array(
>> (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
>> 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
>> 45431, 100136)),
>> (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
>> 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
>> 45431, 162076)),
>> (ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, 44683,
>> 19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431, 36318,
>> 162076))
>> )
>>
>> I need to compare first 5 elements of ID1 with first five element of ID3
>> next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
>> order until the end of number of elements.
>> Let me know if this helps
>>
>>
>> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Apologies I misinterpreted could you post two use cases?
>>> Kr
>>>
>>> On 24 Jul 2016 3:41 pm, "janardhan shetty" <janardhan...@gmail.com>
>>> wrote:
>>>
>>>> Marco,
>>>>
>>>> Thanks for the response. It is indexed order and not ascending or
>>>> descending order.
>>>> On Jul 24, 2016 7:37 AM, "Marco Mistroni" <mmistr...@gmail.com> wrote:
>>>>
>>>>> Use map values to transform to an rdd where values are sorted?
>>>>> Hth
>>>>>
>>>>> On 24 Jul 2016 6:23 am, "janardhan shetty" <janardhan...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have a key,value pair rdd where value is an array of Ints. I need
>>>>>> to maintain the order of the value in order to execute downstream
>>>>>> modifications. How do we maintain the order of values?
>>>>>> Ex:
>>>>>> rdd = (id1,[5,2,3,15],
>>>>>> Id2,[9,4,2,5])
>>>>>>
>>>>>> Followup question how do we compare between one element in rdd with
>>>>>> all other elements ?
>>>>>>
>>>>>
>>


Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread janardhan shetty
Is there any implementation of FPGrowth and Association rules in Spark
Dataframes ?
We have in RDD but any pointers to Dataframes ?


Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Array(
(ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272,
100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076,
45431, 100136)),
(ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244,
100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703, 160992,
45431, 162076)),
(ID2,Array(308703, 54477, 89366, 39244, 150022, 72636, 817, 58866, 44683,
19622, 160992, 107937, 100079, 100136, 145272, 64544, 18159, 45431, 36318,
162076))
)

I need to compare first 5 elements of ID1 with first five element of ID3
next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
order until the end of number of elements.
Let me know if this helps


On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Apologies I misinterpreted could you post two use cases?
> Kr
>
> On 24 Jul 2016 3:41 pm, "janardhan shetty" <janardhan...@gmail.com> wrote:
>
>> Marco,
>>
>> Thanks for the response. It is indexed order and not ascending or
>> descending order.
>> On Jul 24, 2016 7:37 AM, "Marco Mistroni" <mmistr...@gmail.com> wrote:
>>
>>> Use map values to transform to an rdd where values are sorted?
>>> Hth
>>>
>>> On 24 Jul 2016 6:23 am, "janardhan shetty" <janardhan...@gmail.com>
>>> wrote:
>>>
>>>> I have a key,value pair rdd where value is an array of Ints. I need to
>>>> maintain the order of the value in order to execute downstream
>>>> modifications. How do we maintain the order of values?
>>>> Ex:
>>>> rdd = (id1,[5,2,3,15],
>>>> Id2,[9,4,2,5])
>>>>
>>>> Followup question how do we compare between one element in rdd with all
>>>> other elements ?
>>>>
>>>


Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty
Marco,

Thanks for the response. It is indexed order and not ascending or
descending order.
On Jul 24, 2016 7:37 AM, "Marco Mistroni" <mmistr...@gmail.com> wrote:

> Use map values to transform to an rdd where values are sorted?
> Hth
>
> On 24 Jul 2016 6:23 am, "janardhan shetty" <janardhan...@gmail.com> wrote:
>
>> I have a key,value pair rdd where value is an array of Ints. I need to
>> maintain the order of the value in order to execute downstream
>> modifications. How do we maintain the order of values?
>> Ex:
>> rdd = (id1,[5,2,3,15],
>> Id2,[9,4,2,5])
>>
>> Followup question how do we compare between one element in rdd with all
>> other elements ?
>>
>


Locality sensitive hashing

2016-07-24 Thread janardhan shetty
I was looking through to implement locality sensitive hashing in dataframes.
Any pointers for reference?


Maintaining order of pair rdd

2016-07-23 Thread janardhan shetty
I have a key,value pair rdd where value is an array of Ints. I need to
maintain the order of the value in order to execute downstream
modifications. How do we maintain the order of values?
Ex:
rdd = (id1,[5,2,3,15],
Id2,[9,4,2,5])

Followup question how do we compare between one element in rdd with all
other elements ?


Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Adding this to build.sbt worked. Thanks Jacek

assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
  case "application.conf"=> MergeStrategy.concat
  case "unwanted.txt"=>
MergeStrategy.discard
  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
  case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

On Fri, Jul 22, 2016 at 7:44 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> See https://github.com/sbt/sbt-assembly#merge-strategy
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 22, 2016 at 4:23 PM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Changed to sbt.0.14.3 and it gave :
> >
> > [info] Packaging
> >
> /Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar
> > ...
> > java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
> > at
> java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233)
> >
> > Do we need to create assembly.sbt file inside project directory if so
> what
> > will the the contents of it for this config ?
> >
> > On Fri, Jul 22, 2016 at 5:42 AM, janardhan shetty <
> janardhan...@gmail.com>
> > wrote:
> >>
> >> Is scala version also the culprit? 2.10 and 2.11.8
> >>
> >> Also Can you give the steps to create sbt package command just like
> maven
> >> install from within intellij to create jar file in target directory ?
> >>
> >> On Jul 22, 2016 5:16 AM, "Jacek Laskowski" <ja...@japila.pl> wrote:
> >>>
> >>> Hi,
> >>>
> >>> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
> >>> start over. See
> >>>
> >>>
> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
> >>> for a sample Scala/sbt project with Spark 2.0 RC5.
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
> >>> <janardhan...@gmail.com> wrote:
> >>> > Hi,
> >>> >
> >>> > I was setting up my development environment.
> >>> >
> >>> > Local Mac laptop setup
> >>> > IntelliJ IDEA 14CE
> >>> > Scala
> >>> > Sbt (Not maven)
> >>> >
> >>> > Error:
> >>> > $ sbt package
> >>> > [warn] ::
> >>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
> >>> > [warn] ::
> >>> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > [warn] ::
> >>> > [warn]
> >>> > [warn] Note: Some unresolved dependencies have extra attributes.
> >>> > Check
> >>> > that these dependencies exist with the requested attributes.
> >>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> >>> > sbtVersion=0.13)
> >>> > [warn]
> >>> > [warn] Note: Unresolved dependencies path:
> >>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> >>> > sbtVersion=0.13)
> >>> >
> >>> >
> (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
> >>> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
> >>> > (scalaVersion=2.10, sbtVersion=0.13)
> >>> > sbt.ResolveException: unresolved dependency:
> >>> > com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > sbt.ResolveException: unresolved dependency:
> >>> > com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > at sbt.IvyActions$.sbt$IvyActions$$resolve(

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Changed to sbt.0.14.3 and it gave :

[info] Packaging
/Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar
...
java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
at java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233)

Do we need to create assembly.sbt file inside project directory if so what
will the the contents of it for this config ?

On Fri, Jul 22, 2016 at 5:42 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Is scala version also the culprit? 2.10 and 2.11.8
>
> Also Can you give the steps to create sbt package command just like maven
> install from within intellij to create jar file in target directory ?
> On Jul 22, 2016 5:16 AM, "Jacek Laskowski" <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
>> start over. See
>>
>> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
>> for a sample Scala/sbt project with Spark 2.0 RC5.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
>> <janardhan...@gmail.com> wrote:
>> > Hi,
>> >
>> > I was setting up my development environment.
>> >
>> > Local Mac laptop setup
>> > IntelliJ IDEA 14CE
>> > Scala
>> > Sbt (Not maven)
>> >
>> > Error:
>> > $ sbt package
>> > [warn] ::
>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
>> > [warn] ::
>> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
>> > [warn] ::
>> > [warn]
>> > [warn] Note: Some unresolved dependencies have extra attributes.
>> Check
>> > that these dependencies exist with the requested attributes.
>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>> > sbtVersion=0.13)
>> > [warn]
>> > [warn] Note: Unresolved dependencies path:
>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>> > sbtVersion=0.13)
>> > (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
>> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
>> > (scalaVersion=2.10, sbtVersion=0.13)
>> > sbt.ResolveException: unresolved dependency:
>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>> > sbt.ResolveException: unresolved dependency:
>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
>> > at
>> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
>> > at
>> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
>> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
>> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
>> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>> >
>> >
>> >
>> > build.sbt:
>> >
>> > name := "MainTemplate"
>> > version := "1.0"
>> > scalaVersion := "2.11.8"
>> > libraryDependencies ++= {
>> >   val sparkVersion = "2.0.0-preview"
>> >   Seq(
>> > "org.apache.spark" %% "spark-core" % sparkVersion,
>> > "org.apache.spark" %% "spark-sql" % sparkVersion,
>> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
>> > "org.apache.spark" %% "spark-mllib" % sparkVersion
>> >   )
>> > }
>> >
>> > assemblyMergeStrategy in assembly := {
>> >   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>> >   case x => MergeStrategy.first
>> > }
>> >
>> >
>> > plugins.sbt
>> >
>> > logLevel := Level.Warn
>> > addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
>> >
>>
>


Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Is scala version also the culprit? 2.10 and 2.11.8

Also Can you give the steps to create sbt package command just like maven
install from within intellij to create jar file in target directory ?
On Jul 22, 2016 5:16 AM, "Jacek Laskowski" <ja...@japila.pl> wrote:

> Hi,
>
> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
> start over. See
>
> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
> for a sample Scala/sbt project with Spark 2.0 RC5.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
> <janardhan...@gmail.com> wrote:
> > Hi,
> >
> > I was setting up my development environment.
> >
> > Local Mac laptop setup
> > IntelliJ IDEA 14CE
> > Scala
> > Sbt (Not maven)
> >
> > Error:
> > $ sbt package
> > [warn] ::
> > [warn] ::  UNRESOLVED DEPENDENCIES ::
> > [warn] ::
> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
> > [warn] ::
> > [warn]
> > [warn] Note: Some unresolved dependencies have extra attributes.
> Check
> > that these dependencies exist with the requested attributes.
> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> > sbtVersion=0.13)
> > [warn]
> > [warn] Note: Unresolved dependencies path:
> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> > sbtVersion=0.13)
> > (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
> > (scalaVersion=2.10, sbtVersion=0.13)
> > sbt.ResolveException: unresolved dependency:
> > com.eed3si9n#sbt-assembly;0.13.8: not found
> > sbt.ResolveException: unresolved dependency:
> > com.eed3si9n#sbt-assembly;0.13.8: not found
> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
> > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
> > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> >
> >
> >
> > build.sbt:
> >
> > name := "MainTemplate"
> > version := "1.0"
> > scalaVersion := "2.11.8"
> > libraryDependencies ++= {
> >   val sparkVersion = "2.0.0-preview"
> >   Seq(
> > "org.apache.spark" %% "spark-core" % sparkVersion,
> > "org.apache.spark" %% "spark-sql" % sparkVersion,
> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
> > "org.apache.spark" %% "spark-mllib" % sparkVersion
> >   )
> > }
> >
> > assemblyMergeStrategy in assembly := {
> >   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
> >   case x => MergeStrategy.first
> > }
> >
> >
> > plugins.sbt
> >
> > logLevel := Level.Warn
> > addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
> >
>


Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Hi,

I was setting up my development environment.

Local Mac laptop setup
IntelliJ IDEA 14CE
Scala
Sbt (Not maven)

Error:
$ sbt package
[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
[warn] ::
[warn]
[warn] Note: Some unresolved dependencies have extra attributes.  Check
that these dependencies exist with the requested attributes.
[warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
sbtVersion=0.13)
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
sbtVersion=0.13)
(/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
[warn]   +- default:maintemplate-build:0.1-SNAPSHOT
(scalaVersion=2.10, sbtVersion=0.13)
sbt.ResolveException: unresolved dependency:
com.eed3si9n#sbt-assembly;0.13.8: not found
sbt.ResolveException: unresolved dependency:
com.eed3si9n#sbt-assembly;0.13.8: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)



build.sbt:

name := "MainTemplate"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= {
  val sparkVersion = "2.0.0-preview"
  Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
  )
}

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}


plugins.sbt

logLevel := Level.Warn
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")