Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-12 Thread Jayant Shekhar
Hello Chetan,

Sorry missed replying earlier. You can find some sample code here :

http://sparkflows.readthedocs.io/en/latest/user-guide/python/pipe-python.html

We will continue adding more there.

Feel free to ping me directly in case of questions.

Thanks,
Jayant


On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri 
wrote:

> Hello Jayant,
>
> Thank you so much for suggestion. My view was to  use Python function as
> transformation which can take couple of column names and return object.
> which you explained. would that possible to point me to similiar codebase
> example.
>
> Thanks.
>
> On Fri, Jul 6, 2018 at 2:56 AM, Jayant Shekhar 
> wrote:
>
>> Hello Chetan,
>>
>> We have currently done it with .pipe(.py) as Prem suggested.
>>
>> That passes the RDD as CSV strings to the python script. The python
>> script can either process it line by line, create the result and return it
>> back. Or create things like Pandas Dataframe for processing and finally
>> write the results back.
>>
>> In the Spark/Scala/Java code, you get an RDD of string, which we convert
>> back to a Dataframe.
>>
>> Feel free to ping me directly in case of questions.
>>
>> Thanks,
>> Jayant
>>
>>
>> On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Prem sure, Thanks for suggestion.
>>>
>>> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure 
>>> wrote:
>>>
>>>> try .pipe(.py) on RDD
>>>>
>>>> Thanks,
>>>> Prem
>>>>
>>>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>>>> chetan.opensou...@gmail.com> wrote:
>>>>
>>>>> Can someone please suggest me , thanks
>>>>>
>>>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, <
>>>>> chetan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hello Dear Spark User / Dev,
>>>>>>
>>>>>> I would like to pass Python user defined function to Spark Job
>>>>>> developed using Scala and return value of that function would be returned
>>>>>> to DF / Dataset API.
>>>>>>
>>>>>> Can someone please guide me, which would be best approach to do this.
>>>>>> Python function would be mostly transformation function. Also would like 
>>>>>> to
>>>>>> pass Java Function as a String to Spark / Scala job and it applies to 
>>>>>> RDD /
>>>>>> Data Frame and should return RDD / Data Frame.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>


Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Jayant Shekhar
Hello Chetan,

We have currently done it with .pipe(.py) as Prem suggested.

That passes the RDD as CSV strings to the python script. The python script
can either process it line by line, create the result and return it back.
Or create things like Pandas Dataframe for processing and finally write the
results back.

In the Spark/Scala/Java code, you get an RDD of string, which we convert
back to a Dataframe.

Feel free to ping me directly in case of questions.

Thanks,
Jayant


On Thu, Jul 5, 2018 at 3:39 AM, Chetan Khatri 
wrote:

> Prem sure, Thanks for suggestion.
>
> On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure  wrote:
>
>> try .pipe(.py) on RDD
>>
>> Thanks,
>> Prem
>>
>> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Can someone please suggest me , thanks
>>>
>>> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
>>> wrote:
>>>
 Hello Dear Spark User / Dev,

 I would like to pass Python user defined function to Spark Job
 developed using Scala and return value of that function would be returned
 to DF / Dataset API.

 Can someone please guide me, which would be best approach to do this.
 Python function would be mostly transformation function. Also would like to
 pass Java Function as a String to Spark / Scala job and it applies to RDD /
 Data Frame and should return RDD / Data Frame.

 Thank you.




>>
>


Re: UI for spark machine learning.

2017-07-10 Thread Jayant Shekhar
Hello Mahesh,

We have built one. You can download from here :
https://www.sparkflows.io/download

Feel free to ping me for any questions, etc.

Best Regards,
Jayant


On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
mahesh_sawai...@persistent.com> wrote:

> Hi,
>
>
> 1) Is anyone aware of any workbench kind of tool to run ML jobs in spark.
> Specifically is the tool  could be something like a Web application that is
> configured to connect to a spark cluster.
>
>
> User is able to select input training sets probably from hdfs , train and
> then run predictions, without having to write any Scala code.
>
>
> 2) If there is not tool, is there value in having such tool, what could be
> the challenges.
>
>
> Thanks,
>
> Mahesh
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jayant Shekhar
Hello Gaurav,

Pre-calculating the results would obviously be a great idea - and load the
results into a serving store from where you serve it out to your customers
- as suggested by Jorn.

And run it every hour/day, depending on your requirements.

Zeppelin (as mentioned by Ayan) would not be a good tool for this use case
as it more for interactive data exploration.

You can hand-code your spark jobs, or if SQL does the job you can use it,
or use a drag and drop tool for creating your workflows for your reports
and/or incorporate ML into it.

Jayant




On Mon, Apr 17, 2017 at 7:17 AM, ayan guha  wrote:

> Zeppelin is more useful for interactive data exploration. If tye reports
> are known beforehand then any good reporting tool should work, such as
> tablaue, qlic, power bi etc. zeppelin is not fit for this use case.
>
> On Mon, 17 Apr 2017 at 6:57 pm, Gaurav Pandya 
> wrote:
>
>> Thanks Jorn. Yes, I will precalculate the results. Do you think Zeppelin
>> can work here?
>>
>> On Mon, Apr 17, 2017 at 1:41 PM, Jörn Franke 
>> wrote:
>>
>>> Processing through Spark is fine, but I do not recommend that each of
>>> the users triggers a Spark query. So either you precalculate the reports in
>>> Spark so that the reports themselves do not trigger Spark queries or you
>>> have a database that serves the report. For the latter case there are tons
>>> of commercial tools. Depending on the type of report you can also use a
>>> custom report tool or write your own dashboard with ds3.js visualizations.
>>>
>>> On 17. Apr 2017, at 09:49, Gaurav Pandya 
>>> wrote:
>>>
>>> Thanks for the revert Jorn.
>>> In my case, I am going to put the analysis on e-commerce website so
>>> naturally users will be more and it will keep growing when e-commerce
>>> website captures market. Users will not be doing any analysis here. Reports
>>> will show their purchasing behaviour and pattern (kind of Machine learning
>>> stuff).
>>> Please note that all processing will be done in Spark here. Please share
>>> your thoughts. Thanks again.
>>>
>>> On Mon, Apr 17, 2017 at 12:58 PM, Jörn Franke 
>>> wrote:
>>>
 I think it highly depends on your requirements. There are various tools
 for analyzing and visualizing data. How many concurrent users do you have?
 What analysis do they do? How much data is involved? Do they have to
 process the data all the time or can they live with sampling which
 increases performance and response time significantly.
 In lambda architecture terms you may want to think about different
 technologies in the serving layer.

 > On 17. Apr 2017, at 06:55, Gaurav1809 
 wrote:
 >
 > Hi All, I am looking for a data visualization (and analytics) tool. My
 > processing is done through Spark. There are many tools available
 around us.
 > I got some suggestions on Apache Zeppelin too? Can anybody throw some
 light
 > on its power and capabilities when it comes to data analytics and
 > visualization? If there are any better options than this, do suggest
 too.
 > One of the options came to me was Kibana (from ELK stack). Thanks.
 >
 >
 >
 > --
 > View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Shall-I-use-Apache-Zeppelin-for-data-
 analytics-visualization-tp28604.html
 > Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >
 > -
 > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 >

>>>
>>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Any NLP library for sentiment analysis in Spark?

2017-04-12 Thread Jayant Shekhar
Hello Gaurav,

Yes, Stanford CoreNLP is of course great to use too!

You can find sample code here and pull the UDF's into your project :
https://github.com/sparkflows/sparkflows-stanfordcorenlp

Thanks,
Jayant


On Tue, Apr 11, 2017 at 8:44 PM, Gaurav Pandya 
wrote:

> Thanks guys.
> How about Standford CoreNLP?
> Any reviews/ feedback?
> Please share the details if anyone has used it in past.
>
>
> On Tue, Apr 11, 2017 at 11:46 PM,  wrote:
>
>> I think team used this awhile ago, but there was some tweak that needed
>> to be made to get it to work.
>>
>> https://github.com/databricks/spark-corenlp
>>
>> From: Gabriel James  gabriel.ja...@heliase.com>>
>> Organization: Heliase Genomics
>> Reply-To: "gabriel.ja...@heliase.com" <
>> gabriel.ja...@heliase.com>
>> Date: Tuesday, April 11, 2017 at 2:13 PM
>> To: 'Kevin Wang' >, 'Alonso
>> Isidoro Roman' >
>> Cc: 'Gaurav1809' >,
>> "user@spark.apache.org" <
>> user@spark.apache.org>
>> Subject: RE: Any NLP library for sentiment analysis in Spark?
>>
>> Me too. Experiences and recommendations please.
>>
>> Gabriel
>>
>> From: Kevin Wang [mailto:buz...@gmail.com]
>> Sent: Wednesday, April 12, 2017 6:11 AM
>> To: Alonso Isidoro Roman >
>> Cc: Gaurav1809 >;
>> user@spark.apache.org
>> Subject: Re: Any NLP library for sentiment analysis in Spark?
>>
>> I am also interested in this topic.  Anything else anyone can recommend?
>> Thanks.
>>
>> Best,
>>
>> Kevin
>>
>> On Tue, Apr 11, 2017 at 5:00 AM, Alonso Isidoro Roman > > wrote:
>> i did not use it yet, but this library looks promising:
>>
>> https://github.com/databricks/spark-corenlp> se.proofpoint.com/v2/url?u=https-3A__github.com_databrick
>> s_spark-2Dcorenlp=DwMFaQ=nulvIAQnC0yOOjC0e0NVa8TOcyq9jNh
>> jZ156R-JJU10=CxpqDYMuQy-1uNI-UOyUbaX6BMPCZXH8d8evuCoP_OA&
>> m=p0pC_ApR9n9mGtHzisoah42mDvjxDWoGZUhPltYAqWM=g6w0vu-jlVZ1
>> dCTNujjMes1yndAzWXfqKCfKXAinx9c=>
>>
>>
>> Alonso Isidoro Roman
>> about.me/alonso.isidoro.roman
>>
>>
>> 2017-04-11 11:02 GMT+02:00 Gaurav1809  gauravhpan...@gmail.com>>:
>> Hi All,
>>
>> I need to determine sentiment for given document (statement, paragraph
>> etc.)
>> Is there any NLP library available with Apache Spark that I can use here?
>>
>> Any other pointers towards this would be highly appreciated.
>>
>> Thanks in advance.
>> Gaurav Pandya
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Any-NLP-library-for-sentiment-analysis
>> -in-Spark-tp28586.html> url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.
>> nabble.com_Any-2DNLP-2Dlibrary-2Dfor-2Dsentiment-2Danalysis-2Din-2DSpark-
>> 2Dtp28586.html=DwMFaQ=nulvIAQnC0yOOjC0e0NVa8TOcyq9jNhjZ1
>> 56R-JJU10=CxpqDYMuQy-1uNI-UOyUbaX6BMPCZXH8d8evuCoP_OA=
>> p0pC_ApR9n9mGtHzisoah42mDvjxDWoGZUhPltYAqWM=9GDW47Xe8kvyCZ
>> SUXA02Q9fmpPopBEopcbcww6kXQpI=>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org> user-unsubscr...@spark.apache.org>
>>
>>
>>
>


Re: How to use scala.tools.nsc.interpreter.IMain in Spark, just like calling eval in Perl.

2016-06-30 Thread Jayant Shekhar
Hi Fanchao,

This is because it is unable to find the anonymous classes generated.

Adding the below code worked for me. I found the details here :
https://github.com/cloudera/livy/blob/master/repl/src/main/scala/com/cloudera/livy/repl/SparkInterpreter.scala

// Spark 1.6 does not have "classServerUri"; instead, the local
directory where class files
// are stored needs to be registered in SparkConf. See comment in
// SparkILoop::createSparkContext().

Try(sparkIMain.getClass().getMethod("classServerUri")) match {
  case Success(method) =>
method.setAccessible(true)
conf.set("spark.repl.class.uri",
method.invoke(sparkIMain).asInstanceOf[String])

  case Failure(_) =>
val outputDir = sparkIMain.getClass().getMethod("getClassOutputDirectory")
outputDir.setAccessible(true)
conf.set("spark.repl.class.outputDir",
  outputDir.invoke(sparkIMain).asInstanceOf[File].getAbsolutePath())
}


Thanks,

Jayant



On Thu, Jun 30, 2016 at 12:34 AM, Fanchao Meng 
wrote:

> Hi Spark Community,
>
> I am trying to dynamically interpret code given as a String in Spark, just
> like calling the eval in Perl language. However, I got problem when running
> the program. Really appreciate for your help.
>
> **Requirement:**
>
> The requirement is to make the spark processing chain configurable. For
> example, customer could set the processing steps in configuration file as
> below. Steps:
>  1) textFile("files///")
>  2) flatMap(line => line.split(" "))
>  3) map(word => (word, 1))
>  4) reduceByKey(_ + _)
>  5) foreach(println)
>
> All above steps are defined in a configuration file.
> Then, the spark driver will load the configuration file and make the
> processing steps as a string, such as:
>
>  val processFlow =
>  """
>  sc.textFile("file:///input.txt").flatMap(line => line.split("
> ")).map(word => (word, 1)).reduceByKey(_ + _).foreach(println)
>  """
>
> Then, Spark will execute the piece of code defined in above variable
> processFlow.
>
> **Here is my Spark source code:**
>
> It is from word count sample, I just make the RDD methods invoked by
> interpreter as a string.
>
>  import org.apache.spark.SparkConf
>  import org.apache.spark.SparkContext
>  import scala.collection.mutable.{Map, ArraySeq}
>  import scala.tools.nsc.GenericRunnerSettings
>  import scala.tools.nsc.interpreter.IMain
>  class TestMain {
>def exec(): Unit = {
>  val out = System.out
>  val flusher = new java.io.PrintWriter(out)
>  val interpreter = {
>val settings = new GenericRunnerSettings( println _ )
>settings.usejavacp.value = true
>new IMain(settings, flusher)
>  }
>  val conf = new SparkConf().setAppName("TestMain")
>  val sc = new SparkContext(conf)
>  val methodChain =
>"""
>val textFile = sc.textFile("file:///input.txt")
>textFile.flatMap(line => line.split(" ")).map(word => (word,
> 1)).reduceByKey(_ + _).foreach(println)
>"""
>  interpreter.bind("sc", sc);
>  val resultFlag = interpreter.interpret(methodChain)
>}
>  }
>  object TestMain {
>def main(args: Array[String]) {
>  val testMain = new TestMain()
>  testMain.exec()
>  System.exit(0)
>}
>  }
>
> **Problem:**
>
> However, I got an error when running above Spark code (master=local), logs
> as below.
>
>  sc: org.apache.spark.SparkContext =
> org.apache.spark.SparkContext@7d87addd
>  org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
> stage 0.0 (TID 0, localhost): java.lang.ClassNotFoundException: $anonfun$1
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  at java.lang.Class.forName0(Native Method)
>  at java.lang.Class.forName(Class.java:270)
>  at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>  at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>  at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>  at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at
> 

Re: Running into issue using SparkIMain

2016-06-29 Thread Jayant Shekhar
Hello,

Found a workaround to it. Installed scala and added the scala jars to the
classpath before starting the web application.

Now it works smoothly - just that it adds an extra step for the users to do.

Would next look into making it work with the scala jar files contained in
the war.

Thx

On Mon, Jun 27, 2016 at 5:53 PM, Jayant Shekhar <jayantbaya...@gmail.com>
wrote:

> I tried setting the classpath explicitly in the settings. Classpath gets
> printed properly, it has the scala jars in it like
> scala-compiler-2.10.4.jar, scala-library-2.10.4.jar.
>
> It did not help. Still runs great with IntelliJ, but runs into issues when
> running from the command line.
>
> val cl = this.getClass.getClassLoader
>
> val urls = cl match {
>
>   case cl: java.net.URLClassLoader => cl.getURLs.toList
>
>   case a => sys.error("oops: I was expecting an URLClassLoader, found
> a " + a.getClass)
>
> }
>
> val classpath = urls map {_.toString}
>
> println("classpath=" + classpath);
>
> settings.classpath.value =
> classpath.distinct.mkString(java.io.File.pathSeparator)
>
> settings.embeddedDefaults(cl)
>
>
> -Jayant
>
>
> On Mon, Jun 27, 2016 at 3:19 PM, Jayant Shekhar <jayantbaya...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I'm trying to run scala code in  a Web Application.
>>
>> It runs great when I am running it in IntelliJ
>> Run into error when I run it from the command line.
>>
>> Command used to run
>> --
>>
>> java -Dscala.usejavacp=true  -jar target/XYZ.war 
>> --spring.config.name=application,db,log4j
>> --spring.config.location=file:./conf/history
>>
>> Error
>> ---
>>
>> Failed to initialize compiler: object scala.runtime in compiler mirror
>> not found.
>>
>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>
>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>
>> ** object programatically, settings.usejavacp.value = true.
>>
>> 16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
>> set up.  Assuming no postInit code.
>>
>>
>> I'm also setting the following:
>> 
>>
>> val settings = new Settings()
>>
>>  settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())
>>
>>  settings.usejavacp.value = true
>>
>> Any pointers to the solution would be great.
>>
>> Thanks,
>> Jayant
>>
>>
>


Re: Running into issue using SparkIMain

2016-06-27 Thread Jayant Shekhar
I tried setting the classpath explicitly in the settings. Classpath gets
printed properly, it has the scala jars in it like
scala-compiler-2.10.4.jar, scala-library-2.10.4.jar.

It did not help. Still runs great with IntelliJ, but runs into issues when
running from the command line.

val cl = this.getClass.getClassLoader

val urls = cl match {

  case cl: java.net.URLClassLoader => cl.getURLs.toList

  case a => sys.error("oops: I was expecting an URLClassLoader, found a
" + a.getClass)

}

val classpath = urls map {_.toString}

println("classpath=" + classpath);

settings.classpath.value =
classpath.distinct.mkString(java.io.File.pathSeparator)

settings.embeddedDefaults(cl)


-Jayant


On Mon, Jun 27, 2016 at 3:19 PM, Jayant Shekhar <jayantbaya...@gmail.com>
wrote:

> Hello,
>
> I'm trying to run scala code in  a Web Application.
>
> It runs great when I am running it in IntelliJ
> Run into error when I run it from the command line.
>
> Command used to run
> --
>
> java -Dscala.usejavacp=true  -jar target/XYZ.war 
> --spring.config.name=application,db,log4j
> --spring.config.location=file:./conf/history
>
> Error
> ---
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not
> found.
>
> ** Note that as of 2.8 scala does not assume use of the java classpath.
>
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>
> ** object programatically, settings.usejavacp.value = true.
>
> 16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
> set up.  Assuming no postInit code.
>
>
> I'm also setting the following:
> 
>
> val settings = new Settings()
>
>  settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())
>
>  settings.usejavacp.value = true
>
> Any pointers to the solution would be great.
>
> Thanks,
> Jayant
>
>


Running into issue using SparkIMain

2016-06-27 Thread Jayant Shekhar
Hello,

I'm trying to run scala code in  a Web Application.

It runs great when I am running it in IntelliJ
Run into error when I run it from the command line.

Command used to run
--

java -Dscala.usejavacp=true  -jar target/XYZ.war
--spring.config.name=application,db,log4j
--spring.config.location=file:./conf/history

Error
---

Failed to initialize compiler: object scala.runtime in compiler mirror not
found.

** Note that as of 2.8 scala does not assume use of the java classpath.

** For the old behavior pass -usejavacp to scala, or if using a Settings

** object programatically, settings.usejavacp.value = true.

16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
set up.  Assuming no postInit code.


I'm also setting the following:


val settings = new Settings()

 settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())

 settings.usejavacp.value = true

Any pointers to the solution would be great.

Thanks,
Jayant


Re: Running JavaBased Implementation of StreamingKmeans Spark

2016-06-24 Thread Jayant Shekhar
Hi Biplop,

Can you try adding new files to the training/test directories after you
have started your streaming application! Especially the test directory as
you are printing your predictions.

On Fri, Jun 24, 2016 at 2:32 PM, Biplob Biswas 
wrote:

>
> Hi,
>
> I implemented the streamingKmeans example provided in the spark website but
> in Java.
> The full implementation is here,
>
> http://pastebin.com/CJQfWNvk
>
> But i am not getting anything in the output except occasional timestamps
> like one below:
>
> ---
> Time: 1466176935000 ms
> ---
>
> Also, i have 2 directories:
> "D:\spark\streaming example\Data Sets\training"
> "D:\spark\streaming example\Data Sets\test"
>
> and inside these directories i have 1 file each "samplegpsdata_train.txt"
> and "samplegpsdata_test.txt" with training data having 500 datapoints and
> test data with 60 datapoints.
>
> I am very new to the spark systems and any help is highly appreciated.
>
> Thank you so much
> Biplob Biswas
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementation-of-StreamingKmeans-Spark-tp27225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark ml and PMML export

2016-06-23 Thread Jayant Shekhar
Thanks Philippe! Looking forward to trying it out. I am on >= 1.6

Jayant

On Thu, Jun 23, 2016 at 1:24 AM, philippe v  wrote:

> Hi,
>
> You can try this lib : https://github.com/jpmml/jpmml-sparkml
>
> I'll try it soon... you need to be in >=1.6
>
> Philippe
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ml-and-PMML-export-tp26773p27215.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark ml and PMML export

2016-06-23 Thread Jayant Shekhar
Thanks a lot Nick! Its very helpful.

On Wed, Jun 22, 2016 at 11:47 PM, Nick Pentreath 
wrote:

> Currently there is no way within Spark itself. You may want to check out
> this issue (https://issues.apache.org/jira/browse/SPARK-11171) and here
> is an external project working on it (
> https://github.com/jpmml/jpmml-sparkml), that covers quite a number of
> transformers and models (but not all).
>
>
> On Thu, 23 Jun 2016 at 02:37 jayantshekhar 
> wrote:
>
>> I have the same question on Spark ML and PMML export as Philippe.
>>
>> Is there a way to export Spark ML generated models to PMML?
>>
>> Jayant
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ml-and-PMML-export-tp26773p27213.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Getting a DataFrame back as result from SparkIMain

2016-06-21 Thread Jayant Shekhar
Hi,

I have written a program using SparkIMain which creates an RDD and I am
looking for a way to access that RDD in my normal spark/scala code for
further processing.


The code below binds the SparkContext::

sparkIMain.bind("sc", "org.apache.spark.SparkContext", sparkContext,
List("""@transient"""))


The code below creates an RDD::

val code = "val data = Array(1, 2, 3, 4, 5)"

sparkIMain.interpret(code)

val code1 = "val distData = sc.parallelize(data)"

sparkIMain.interpret(code1)


Looking for a way to get the RDD 'distData' back into the spark code for
further processing. Essentially I need to pass the RDD for the next steps
for processing.

Any pointers are welcome! The above framework is used in a more general
case where the code is supplied by the user at runtime.

Thanks,
Jayant


Re: MLLib: LinearRegressionWithSGD performance

2014-11-21 Thread Jayant Shekhar
Hi Sameer,

You can try increasing the number of executor-cores.

-Jayant





On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak ssti...@live.com wrote:

 Hi All,
 I have been using MLLib's linear regression and I have some question
 regarding the performance. We have a cluster of 10 nodes -- each node has
 24 cores and 148GB memory. I am running my app as follows:

 time spark-submit --class medslogistic.MedsLogistic --master yarn-client
 --executor-memory 6G --num-executors 10 /pathtomyapp/myapp.jar

 I am also going to play with number of executors (reduce it) may be that
 will give us different results.

 The input is a 800MB sparse file in LibSVNM format. Total number of
 features is 150K. It takes approximately 70 minutes for the regression to
 finish. The job imposes very little load on CPU, memory, network, and disk. 
 Total
 number of tasks is 104.  Total time gets divided fairly uniformly across
 these tasks each task. I was wondering, is it possible to reduce the
 execution time further?


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: MLLib: LinearRegressionWithSGD performance

2014-11-21 Thread Jayant Shekhar
Hi Sameer,

You can also use repartition to create a higher number of tasks.

-Jayant


On Fri, Nov 21, 2014 at 12:02 PM, Jayant Shekhar jay...@cloudera.com
wrote:

 Hi Sameer,

 You can try increasing the number of executor-cores.

 -Jayant





 On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak ssti...@live.com wrote:

 Hi All,
 I have been using MLLib's linear regression and I have some question
 regarding the performance. We have a cluster of 10 nodes -- each node has
 24 cores and 148GB memory. I am running my app as follows:

 time spark-submit --class medslogistic.MedsLogistic --master yarn-client
 --executor-memory 6G --num-executors 10 /pathtomyapp/myapp.jar

 I am also going to play with number of executors (reduce it) may be that
 will give us different results.

 The input is a 800MB sparse file in LibSVNM format. Total number of
 features is 150K. It takes approximately 70 minutes for the regression to
 finish. The job imposes very little load on CPU, memory, network, and disk. 
 Total
 number of tasks is 104.  Total time gets divided fairly uniformly across
 these tasks each task. I was wondering, is it possible to reduce the
 execution time further?


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Jayant Shekhar
Hi Albert,

Have a couple of questions:

   - You mentioned near real-time. What exactly is your SLA for processing
   each document?
   - Which crawler are you using and are you looking to bring in Hadoop
   into your overall workflow. You might want to read up on how network
   traffic is minimized/managed on the Hadoop cluster - as you had run into
   network issues with your current architecture.

Thanks!

On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila albert.v...@augure.com
wrote:

 Hi

 I'm evaluating Spark streaming to see if it fits to scale or current
 architecture.

 We are currently downloading and processing 6M documents per day from
 online and social media. We have a different workflow for each type of
 document, but some of the steps are keyword extraction, language detection,
 clustering, classification, indexation,  We are using Gearman to
 dispatch the job to workers and we have some queues on a database.
 Everything is in near real time.

 I'm wondering if we could integrate Spark streaming on the current
 workflow and if it's feasible. One of our main discussions are if we have
 to go to a fully distributed architecture or to a semi-distributed one. I
 mean, distribute everything or process some steps on the same machine
 (crawling, keyword extraction, language detection, indexation). We don't
 know which one scales more, each one has pros and cont.

 Now we have a semi-distributed one as we had network problems taking into
 account the amount of data we were moving around. So now, all documents
 crawled on server X, later on are dispatched through Gearman to the same
 server. What we dispatch on Gearman is only the document id, and the
 document data remains on the crawling server on a Memcached, so the network
 traffic is keep at minimum.

 It's feasible to remove all database queues and Gearman and move to Spark
 streaming? We are evaluating to add Kakta to the system too.
 Is anyone using Spark streaming for a system like ours?
 Should we worry about the network traffic? or it's something Spark can
 manage without problems. Every document is arround 50k (300Gb a day +/-).
 If we wanted to isolate some steps to be processed on the same machine/s
 (or give priority), is something we could do with Spark?

 Any help or comment will be appreciate. And If someone has had a similar
 problem and has knowledge about the architecture approach will be more than
 welcomed.

 Thanks




Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Jayant Shekhar
+1 to Sean.

Is it possible to rewrite your code to not use SparkContext in RDD. Or why
does javaFunctions() need the SparkContext.

On Thu, Oct 23, 2014 at 10:53 AM, Localhost shell 
universal.localh...@gmail.com wrote:

 Bang On Sean

 Before sending the issue mail, I was able to remove the compilation error
 by making it final but then got the
 Caused by: java.io.NotSerializableException:
 org.apache.spark.api.java.JavaSparkContext   (As you mentioned)

 Now regarding your suggestion of changing the business logic,
 1. *Is the current approach possible if I write the code in Scala ?* I
 think probably not but wanted to check with you.

 2. Brief steps of what the code is doing:

   1. Get raw sessions data from datatsore (C*)
 2. Process the raw sessions data
 3. Iterate over the processed data(derive from #2) and fetch the
 previously aggregated data from store for those rowkeys
Add the values from this batch to previous batch values
 4. Save back the updated values

* This github gist might explain you more
 https://gist.github.com/rssvihla/6577359860858ccb0b33
 https://gist.github.com/rssvihla/6577359860858ccb0b33 and it does a
 similar thing in scala.*
 I am trying to achieve a similar thing in Java using Spark Batch with
 C* as the datastore.

 I have attached the java code file to provide you some code details. (If I
 was not able to explain you the problem so the code will be handy)


 The reason why I am fetching only selective data (that I will update
 later) because Cassanbdra doesn't provide range queries so I thought
 fetching complete data might be expensive.

 It will be great if you can share ur thoughts.

 On Thu, Oct 23, 2014 at 1:48 AM, Sean Owen so...@cloudera.com wrote:

 In Java, javaSparkContext would have to be declared final in order for
 it to be accessed inside an inner class like this. But this would
 still not work as the context is not serializable. You  should rewrite
 this so you are not attempting to use the Spark context inside  an
 RDD.

 On Thu, Oct 23, 2014 at 8:46 AM, Localhost shell
 universal.localh...@gmail.com wrote:
  Hey All,
 
  I am unable to access objects declared and initialized outside the
 call()
  method of JavaRDD.
 
  In the below code snippet, call() method makes a fetch call to C* but
 since
  javaSparkContext is defined outside the call method scope so compiler
 give a
  compilation error.
 
  stringRdd.foreach(new VoidFunctionString() {
  @Override
  public void call(String str) throws Exception {
  JavaRDDString vals =
  javaFunctions(javaSparkContext).cassandraTable(schema, table,
  String.class)
  .select(val);
  }
  });
 
  In other languages I have used closure to do this but not able to
 achieve
  the same here.
 
  Can someone suggest how to achieve this in the current code context?
 
 
  --Unilocal
 
 
 




 --
 --Unilocal


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: Oryx + Spark mllib

2014-10-19 Thread Jayant Shekhar
Hi Deb,

Do check out https://github.com/OryxProject/oryx.

It does integrate with Spark. Sean has put in quite a bit of neat details
on the page about the architecture. It has all the things you are thinking
about:)

Thanks,
Jayant


On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi,

 Is someone working on a project on integrating Oryx model serving layer
 with Spark ? Models will be built using either Streaming data / Batch data
 in HDFS and cross validated with mllib APIs but the model serving layer
 will give API endpoints like Oryx
 and read the models may be from hdfs/impala/SparkSQL

 One of the requirement is that the API layer should be scalable and
 elastic...as requests grow we should be able to add more nodes...using play
 and akka clustering module...

 If there is a ongoing project on github please point to it...

 Is there a plan of adding model serving and experimentation layer to mllib
 ?

 Thanks.
 Deb





Re: How does the Spark Accumulator work under the covers?

2014-10-10 Thread Jayant Shekhar
Hi Areg,

Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators

val sum = sc.accumulator(0)   // accumulator created from an initial value
in the driver

The accumulator variable is created in the driver. Tasks running on the
cluster can then add to it. However, they cannot read its value. Only the
driver program can read the accumulator’s value, using its value method.

sum.value  // in the driver

 myRdd.map(x = sum += x)
 where is this function being run
This is being run by the tasks in the workers.

The driver accumulates the data from the various workers and mergers them
to get the final result as Haripriya mentioned.

Thanks,
Jayant


On Fri, Oct 10, 2014 at 7:46 AM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.com wrote:

 If you use parallelize, the data is distributed across multiple nodes
 available and sum is computed individually within each partition and later
 merged. The driver manages the entire process. Is my understanding correct?
 Can someone please correct me if I am wrong?

 On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
 abaghdasa...@bloomberg.net wrote:

 Hello,
 I was wondering on what does the Spark accumulator do under the covers.
 I’ve implemented my own associative addInPlace function for the
 accumulator, where is this function being run? Let’s say you call something
 like myRdd.map(x = sum += x) is “sum” being accumulated locally in any
 way, for each element or partition or node? Is “sum” a broadcast variable?
 Or does it only exist on the driver node? How does the driver node get
 access to the “sum”?
 Thanks,
 Areg




 --
 Regards,
 Haripriya Ayyalasomayajula



Re: window every n elements instead of time based

2014-10-08 Thread Jayant Shekhar
Hi Michael,

I think you are meaning batch interval instead of windowing. It can be
helpful for cases when you do not want to process very small batch sizes.

HDFS sink in Flume has the concept of rolling files based on time, number
of events or size.
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

The same could be applied to Spark if the use cases demand. The only major
catch would be that it breaks the concept of window operations which are in
Spark.

Thanks,
Jayant




On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman mich...@videoamp.com
wrote:

 Hi Andrew,

 The use case I have in mind is batch data serialization to HDFS, where
 sizing files to a certain HDFS block size is desired. In my particular use
 case, I want to process 10GB batches of data at a time. I'm not sure this
 is a sensible use case for spark streaming, and I was trying to test it.
 However, I had trouble getting it working and in the end I decided it was
 more trouble than it was worth. So I decided to split my task into two: one
 streaming job on small, time-defined batches of data, and a traditional
 Spark job aggregating the smaller files into a larger whole. In retrospect,
 I think this is the right way to go, even if a count-based window
 specification was possible. Therefore, I can't suggest my use case for a
 count-based window size.

 Cheers,

 Michael

 On Oct 5, 2014, at 4:03 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Michael,

 I couldn't find anything in Jira for it --
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming

 Could you or Adrian please file a Jira ticket explaining the functionality
 and maybe a proposed API?  This will help people interested in count-based
 windowing to understand the state of the feature in Spark Streaming.

 Thanks!
 Andrew

 On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman mich...@videoamp.com
 wrote:

 Hi,

 I also have a use for count-based windowing. I'd like to process data
 batches by size as opposed to time. Is this feature on the development
 roadmap? Is there a JIRA ticket for it?

 Thank you,

 Michael



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Using GraphX with Spark Streaming?

2014-10-06 Thread Jayant Shekhar
Arko,

It would be useful to know more details on the use case you are trying to
solve. As Tobias wrote, Spark Streaming works on DStream, which is a
continuous series of RDDs.

Do check out performance tuning :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
It is important to reduce the processing time of each batch of data.
Ideally you would want data processing to keep up with the data ingestion.

Thanks,
Jayant


On Sun, Oct 5, 2014 at 6:45 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Arko,

 On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee 
 arkoprovomukher...@gmail.com wrote:

 Apologies if this is a stupid question but I am trying to understand
 why this can or cannot be done. As far as I understand that streaming
 algorithms need to be different from batch algorithms as the streaming
 algorithms are generally incremental. Hence the question whether the
 RDD transformations can be extended to streaming or not.


 I don't think that streaming algorithms are generally incremental in
 Spark Streaming. In fact, data is collected and every N seconds
 (minutes/...), the data collected during that interval is batch-processed
 as with normal batch operations. In fact, using data previously obtained
 from the stream (in previous intervals) is a bit more complicated than
 plain batch processing. If the graph you want to create only uses data from
 one interval/batch, that should be dead simple. You might want to have a
 look at
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

 Tobias




Re: Debugging cluster stability, configuration issues

2014-08-21 Thread Jayant Shekhar
Hi Shay,

You can try setting spark.storage.blockManagerSlaveTimeoutMs to a higher
value.

Cheers,
Jayant



On Thu, Aug 21, 2014 at 1:33 PM, Shay Seng s...@urbanengines.com wrote:

 Unfortunately it doesn't look like my executors are OOM. On the slave
 machines I checked both the logs in /spark/log (which I assume is from the
 salve driver?) and in /spark/work/... which I assume are from each
 worker/executor.




 On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Whenever I've seen this exception it has ALWAYS been the case of an
 executor running out of memory. I don't use checkpointing so not too sure
 about the first item. The rest of them I believe would happen if an
 executor fails and the worker spawns a new executor. Usually a good way to
 verify this is if you look in the driver log, where it says Lost TID
 102135 to see where TID 102135 was sent to (which worker). If I'm
 correct and an executor has rolled you would see two executor logs for your
 application -- the first one usually contains an OOM. I run 0.9.1 but I
 believe it should be a pretty similar setup.


 On Thu, Aug 21, 2014 at 1:23 PM, Shay Seng s...@urbanengines.com wrote:

 Hi,

 I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge
 machines
 The cluster is running Spark standalone and is launched with the ec2
 scripts.
 In my Spark job, I am using ephemeral HDFS to checkpoint some of my
 RDDs. I'm also reading and writing to S3. My jobs also involve a large
 amountf of shuffles.

 I run the same job on multiple set of data and for 50-70% of these runs,
 the job completes with no issues. (Typically a rerun will allow the
 failures to complete as well)

 However on the rest of the 30%, I see a bunch of different kinds of
 issues pop up. (which will go away if I rerun the same job)

 (1) Checkpointing silently fails (I assume). the checkpoint dir exists
 in HDFS, but no data files are written out. And a later step in the job
 tries to reload these RDDs and I get a failure about not being able to read
 from HDFS. -- Usually a start, stop-dfs cures this.
 *Q: What could be the cause of this? Timeouts? *


 (2) Other times I get ... no idea who or what is causing this...
 in master /spark/logs:
 2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp://
 sparkmas...@ec2-54-218-216-19.us-west-2.compute.amazonaws.com:7077] -
 [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]:
 Error [Association failed with [akka.tcp://spark@ip-10
 -34-2-246.us-west-2.compute.internal:37681]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sp...@ip-10-34-2-246.us-west-2.compute.internal:37681]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: ip-10-34-2-246.us-west-2.compute.internal/
 10.34.2.246:37681
 ]

 Slave Log:
 2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection
 to ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
 2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading
 SendingConnection to
 ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
 java.nio.channels.ClosedChannelException
 at
 sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 at
 org.apache.spark.network.SendingConnection.read(Connection.scala:398)
 at
 org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 *Q: Where do I even start debugging this kind of issues? Are the
 machines too loaded and so timeouts are getting hit? Am I not setting some
 configuration number correctly? I would be grateful for some hints on where
 to start looking!*


 (3) Often (2) will be preceeded by the following in spark.logs..
 2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147)
 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
 from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
 0)
 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
 from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
 0)
 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
 from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
 0)
 Not sure if this is an indication...



 I'll be very grateful for any ideas on how to start debugging these.
 Is there anything I should be noting -- CPU using on Master/Slave.
 Number of executors/cpu, akka threads etc?

 Cheers,
 shay