Re: lift coefficien

2016-07-22 Thread ndjido
Just apply Lift = Recall / Support formula with respect to a given threshold on 
your population distribution. The computation is quite straightforward. 

Cheers,
Ardo

> On 20 Jul 2016, at 15:05, pseudo oduesp  wrote:
> 
> Hi ,
> how we can claculate lift coeff  from pyspark result of prediction ?
> 
> thanks ?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Pedro Rodriguez
You could either do monotonically_increasing_id or use a window function
and rank. The first is a simple spark SQL function, data bricks has a
pretty helpful post for how to use window functions (in this case the whole
data set is the window).

On Fri, Jul 22, 2016 at 12:20 PM, Marco Mistroni 
wrote:

> Hi
> So u u have a data frame, then use zipwindex and create a tuple 
> I m not sure if df API has something useful for zip w index.
> But u can
> - get a data frame
> - convert it to rdd (there's a tordd )
> - do a zip with index
>
> That will give u a rdd with 3 fields...
> I don't think you can update df columns
> Hth
> On 22 Jul 2016 5:19 pm, "VG"  wrote:
>
> >
>
> > Hi All,
> >
> > Any suggestions for this
> >
> > Regards,
> > VG
> >
> > On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:
>
> >>
>
> >> Hi All,
> >>
> >> I am really confused how to proceed further. Please help.
> >>
> >> I have a dataset created as follows:
> >> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
> >>
> >> Now I need to map each name with a unique index and I did the following
> >> JavaPairRDD indexedBId = business.javaRDD()
> >>
>  .zipWithIndex();
> >>
> >> In later part of the code I need to change a datastructure and update
> name with index value generated above .
> >> I am unable to figure out how to do a look up here..
> >>
> >> Please suggest /.
> >>
> >> If there is a better way to do this please suggest that.
> >>
> >> Regards
> >> VG
> >>
> >
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
I haven't used SparkR/R before, only Scala/Python APIs so I don't know for
sure.

I am guessing if things are in a DataFrame they were read either from some
disk source (S3/HDFS/file/etc) or they were created from parallelize. If
you are using the first, Spark will for the most part choose a reasonable
number of partitions while for parallelize I think it depends on what your
min parallelism is set to.

In my brief google it looks like dapply is an analogue of mapPartitions.
Usually the reason to use this is if your map operation has some expensive
initialization function. For example, you need to open a connection to a
database so its better to re-use that connection for one partition's
elements than create it for each element.

What are you trying to accomplish with dapply?

On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang  wrote:

> Thanks Pedro,
>   so to use sparkR dapply on SparkDataFrame, don't we need partition the
> DataFrame first? the example in doc doesn't seem to do this.
> Without knowing how it partitioned, how can one write the function to
> process each partition?
>
> Neil
>
> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez 
> wrote:
>
>> This should work and I don't think triggers any actions:
>>
>> df.rdd.partitions.length
>>
>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:
>>
>>> Seems no function does this in Spark 2.0 preview?
>>>
>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Thanks Pedro,
  so to use sparkR dapply on SparkDataFrame, don't we need partition the
DataFrame first? the example in doc doesn't seem to do this.
Without knowing how it partitioned, how can one write the function to
process each partition?

Neil

On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez 
wrote:

> This should work and I don't think triggers any actions:
>
> df.rdd.partitions.length
>
> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:
>
>> Seems no function does this in Spark 2.0 preview?
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: spark and plot data

2016-07-22 Thread Taotao.Li
hi, pesudo,

  I've posted a blog before spark-dataframe-introduction
  , and for
me, I use spark dataframe [ or RDD ] to do the logic calculation on all the
datasets, and then transform the result into pandas dataframe, and make
data visualization using pandas dataframe, sometimes you may need
matplotlib or seaborn.

-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io

*github*: www.github.com/litaotao


Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Benjamin Kim
It is included in Cloudera’s CDH 5.8.

> On Jul 22, 2016, at 6:13 PM, Mail.com  wrote:
> 
> Hbase Spark module will be available with Hbase 2.0. Is that out yet?
> 
>> On Jul 22, 2016, at 8:50 PM, Def_Os  wrote:
>> 
>> So it appears it should be possible to use HBase's new hbase-spark module, if
>> you follow this pattern:
>> https://hbase.apache.org/book.html#_sparksql_dataframes
>> 
>> Unfortunately, when I run my example from PySpark, I get the following
>> exception:
>> 
>> 
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
>>> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
>>> does not allow create table as select.
>>>   at scala.sys.package$.error(package.scala:27)
>>>   at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>>>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>   at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>   at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>   at java.lang.reflect.Method.invoke(Method.java:606)
>>>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>>>   at py4j.Gateway.invoke(Gateway.java:259)
>>>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>>>   at java.lang.Thread.run(Thread.java:745)
>> 
>> Even when I created the table in HBase first, it still failed.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Mail.com
Hbase Spark module will be available with Hbase 2.0. Is that out yet?

> On Jul 22, 2016, at 8:50 PM, Def_Os  wrote:
> 
> So it appears it should be possible to use HBase's new hbase-spark module, if
> you follow this pattern:
> https://hbase.apache.org/book.html#_sparksql_dataframes
> 
> Unfortunately, when I run my example from PySpark, I get the following
> exception:
> 
> 
>> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
>> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
>> does not allow create table as select.
>>at scala.sys.package$.error(package.scala:27)
>>at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>>at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.lang.reflect.Method.invoke(Method.java:606)
>>at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>>at py4j.Gateway.invoke(Gateway.java:259)
>>at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>>at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>at py4j.GatewayConnection.run(GatewayConnection.java:209)
>>at java.lang.Thread.run(Thread.java:745)
> 
> Even when I created the table in HBase first, it still failed.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Def_Os
So it appears it should be possible to use HBase's new hbase-spark module, if
you follow this pattern:
https://hbase.apache.org/book.html#_sparksql_dataframes

Unfortunately, when I run my example from PySpark, I get the following
exception:


> py4j.protocol.Py4JJavaError: An error occurred while calling o120.save.
> : java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource
> does not allow create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:259)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)

Even when I created the table in HBase first, it still failed.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread RK Aduri
I can see large number of collections happening on driver and eventually, 
driver is running out of memory. ( am not sure whether you have persisted any 
rdd or data frame). May be you would want to avoid doing so many collections or 
persist unwanted data in memory.

To begin with, you may want to re-run the job with this following config: 
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps” —> and this will give you an idea of how you are 
getting OOM.


> On Jul 22, 2016, at 3:52 PM, Ascot Moss  wrote:
> 
> Hi
> 
> Please help!
> 
>  When running random forest training phase in cluster mode, I got GC overhead 
> limit exceeded.
> 
> I have used two parameters when submitting the job to cluster
> --driver-memory 64g \
> 
> --executor-memory 8g \
> 
> 
> My Current settings:
> (spark-defaults.conf)
> 
> spark.executor.memory   8g
> 
> 
> (spark-env.sh)
> export SPARK_WORKER_MEMORY=8g
> 
> export HADOOP_HEAPSIZE=8000
> 
> 
> 
> Any idea how to resolve it?
> 
> Regards
> 
> 
> 
> 
> 
> 
> ###  (the erro log) ###
> 16/07/23 04:34:04 WARN TaskSetManager: Lost task 2.0 in stage 6.1 (TID 30, 
> n1794): java.lang.OutOfMemoryError: GC overhead limit exceeded
> 
> at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
> 
> at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
> 
> at 
> org.apache.spark.util.collection.CompactBuffer.growToSize(CompactBuffer.scala:144)
> 
> at 
> org.apache.spark.util.collection.CompactBuffer.$plus$plus$eq(CompactBuffer.scala:90)
> 
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)
> 
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)
> 
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.mergeIfKeyExists(ExternalAppendOnlyMap.scala:318)
> 
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:365)
> 
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)
> 
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> 
> at scala.collection.TraversableOnce$class.to 
> (TraversableOnce.scala:273)
> 
> at scala.collection.AbstractIterator.to 
> (Iterator.scala:1157)
> 
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> 
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> 
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
> 
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
> 
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> 
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> 
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 
> at java.lang.Thread.run(Thread.java:745)
> 


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.


ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread Ascot Moss
Hi

Please help!

 When running random forest training phase in cluster mode, I got GC
overhead limit exceeded.

I have used two parameters when submitting the job to cluster

--driver-memory 64g \

--executor-memory 8g \

My Current settings:

(spark-defaults.conf)

spark.executor.memory   8g

(spark-env.sh)

export SPARK_WORKER_MEMORY=8g

export HADOOP_HEAPSIZE=8000


Any idea how to resolve it?

Regards






###  (the erro log) ###

16/07/23 04:34:04 WARN TaskSetManager: Lost task 2.0 in stage 6.1 (TID 30,
n1794): java.lang.OutOfMemoryError: GC overhead limit exceeded

at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)

at
scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)

at
org.apache.spark.util.collection.CompactBuffer.growToSize(CompactBuffer.scala:144)

at
org.apache.spark.util.collection.CompactBuffer.$plus$plus$eq(CompactBuffer.scala:90)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$groupByKey$1$$anonfun$10.apply(PairRDDFunctions.scala:505)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.mergeIfKeyExists(ExternalAppendOnlyMap.scala:318)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:365)

at
org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Hi Ted

In general I want this application to use all available resources. I just
bumped the driver memory to 2G. I also bumped the executor memory up to 2G.

It will take a couple of hours before I know if this made a difference or
not

I am not sure if setting executor memory is a good idea. I am concerned that
this will reduce concurrency

Thanks

Andy

From:  Ted Yu 
Date:  Friday, July 22, 2016 at 2:54 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: Exception in thread "dispatcher-event-loop-1"
java.lang.OutOfMemoryError: Java heap space

> How much heap memory do you give the driver ?
> 
> On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson 
> wrote:
>> Given I get a stack trace in my python notebook I am guessing the driver is
>> running out of memory?
>> 
>> My app is simple it creates a list of dataFrames from s3://, and counts each
>> one. I would not think this would take a lot of driver memory
>> 
>> I am not running my code locally. Its using 12 cores. Each node has 6G.
>> 
>> Any suggestions would be greatly appreciated
>> 
>> Andy
>> 
>> def work():
>> 
>> constituentDFS = getDataFrames(constituentDataSets)
>> 
>> results = ["{} {}".format(name, constituentDFS[name].count()) for name in
>> constituentDFS]
>> 
>> print(results)
>> 
>> return results
>> 
>> 
>> 
>> %timeit -n 1 -r 1 results = work()
>> 
>> 
>>  in (.0)  1 def work():  2
>> constituentDFS = getDataFrames(constituentDataSets)> 3 results = ["{}
>> {}".format(name, constituentDFS[name].count()) for name in constituentDFS]
>> 4 print(results)  5 return results
>> 
>> 16/07/22 17:54:38 WARN TaskSetManager: Stage 146 contains a task of very
>> large size (145 KB). The maximum recommended task size is 100 KB.
>> 
>> 16/07/22 18:39:47 WARN HeartbeatReceiver: Removing executor 2 with no recent
>> heartbeats: 153037 ms exceeds timeout 12 ms
>> 
>> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
>> Java heap space
>> 
>> at java.util.jar.Manifest$FastInputStream.(Manifest.java:332)
>> 
>> at java.util.jar.Manifest$FastInputStream.(Manifest.java:327)
>> 
>> at java.util.jar.Manifest.read(Manifest.java:195)
>> 
>> at java.util.jar.Manifest.(Manifest.java:69)
>> 
>> at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
>> 
>> at java.util.jar.JarFile.getManifest(JarFile.java:180)
>> 
>> at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
>> 
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
>> 
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>> 
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>> 
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>> 
>> at java.security.AccessController.doPrivileged(Native Method)
>> 
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>> 
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> 
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> 
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> 
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl.logExecutorLoss(TaskSchedulerImp
>> l.scala:510)
>> 
>> at 
>> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.s
>> cala:473)
>> 
>> at 
>> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceive
>> r$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:199)
>> 
>> at 
>> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceive
>> r$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:195)
>> 
>> at 
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Traversa
>> bleLike.scala:772)
>> 
>> at 
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> 
>> at 
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>> 
>> at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>> 
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>> 
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>> 
>> at 
>> 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771>>
)
>> 
>> at org.apache.spark.HeartbeatReceiver.org
>> 
>> $apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:195)
>> 
>> at 
>> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(Hea
>> rtbeatReceiver.scala:118)
>> 
>> at 
>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:
>> 104)
>> 
>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>> 
>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>> 
>> 16/07/22 19:08:29 WARN NettyRpcEnv: Ignored message: true
>> 
>> 
> 




Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
As of the most recent 0.6.0 release its partially alleviated, but still not
great (compared to something like Jupyter).

They can be "downloaded" but its only really meaningful in importing it
back to Zeppelin. It would be great if they could be exported as HTML or
PDF, but at present they can't be. I know they have some sort of git
support, but it was never clear to me how it was suppose to be used since
the docs are sparse on that. So far what works best for us is S3 storage,
but you don't get the benefit of Github using that (history + commits etc).

There are a couple other notebooks floating around, Apache Toree seems the
most promising for portability since its based on jupyter
https://github.com/apache/incubator-toree

On Fri, Jul 22, 2016 at 3:53 PM, Gourav Sengupta 
wrote:

> The biggest stumbling block to using Zeppelin has been that we cannot
> download the notebooks, cannot export them and certainly cannot sync them
> back to Github, without mind numbing and sometimes irritating hacks. Have
> those issues been resolved?
>
>
> Regards,
> Gourav
>
>
> On Fri, Jul 22, 2016 at 2:22 PM, Pedro Rodriguez 
> wrote:
>
>> Zeppelin works great. The other thing that we have done in notebooks
>> (like Zeppelin or Databricks) which support multiple types of spark session
>> is register Spark SQL temp tables in our scala code then escape hatch to
>> python for plotting with seaborn/matplotlib when the built in plots are
>> insufficient.
>>
>> —
>> Pedro Rodriguez
>> PhD Student in Large-Scale Machine Learning | CU Boulder
>> Systems Oriented Data Scientist
>> UC Berkeley AMPLab Alumni
>>
>> pedrorodriguez.io | 909-353-4423
>> github.com/EntilZha | LinkedIn
>> 
>>
>> On July 22, 2016 at 3:04:48 AM, Marco Colombo (
>> ing.marco.colo...@gmail.com) wrote:
>>
>> Take a look at zeppelin
>>
>> http://zeppelin.apache.org
>>
>> Il giovedì 21 luglio 2016, Andy Davidson 
>> ha scritto:
>>
>>> Hi Pseudo
>>>
>>> Plotting, graphing, data visualization, report generation are common
>>> needs in scientific and enterprise computing.
>>>
>>> Can you tell me more about your use case? What is it about the current
>>> process / workflow do you think could be improved by pushing plotting (I
>>> assume you mean plotting and graphing) into spark.
>>>
>>>
>>> In my personal work all the graphing is done in the driver on summary
>>> stats calculated using spark. So for me using standard python libs has not
>>> been a problem.
>>>
>>> Andy
>>>
>>> From: pseudo oduesp 
>>> Date: Thursday, July 21, 2016 at 8:30 AM
>>> To: "user @spark" 
>>> Subject: spark and plot data
>>>
>>> Hi ,
>>> i know spark  it s engine  to compute large data set but for me i work
>>> with pyspark and it s very wonderful machine
>>>
>>> my question  we  don't have tools for ploting data each time we have to
>>> switch and go back to python for using plot.
>>> but when you have large result scatter plot or roc curve  you cant use
>>> collect to take data .
>>>
>>> somone have propostion for plot .
>>>
>>> thanks
>>>
>>>
>>
>> --
>> Ing. Marco Colombo
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread Pedro Rodriguez
You might look at monotonically_increasing_id() here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions
instead of converting it to an RDD. since you pay a performance penalty for
that.

If you want to change the name you can do something like this (in scala
since I am not familiar with java API, but it should be similar in java)

val df = sqlContext.sql("select bid, name from
business").withColumn(monotonically_increasing_id().as("id")
// some steps later on
df.withColumn("name", $"id")

I am not 100% what you mean by updating the data structure, I am guessing
you mean replace the name column with the id column? Not, on the second
line the withColumn call uses $"id" which in scala converts to a Column. In
java maybe its something like new Column("id"), not sure.

Pedro

On Fri, Jul 22, 2016 at 12:21 PM, VG  wrote:

> Any suggestions here  please
>
> I basically need an ability to look up *name -> index* and *index -> name*
> in the code
>
> -VG
>
> On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:
>
>> Hi All,
>>
>> I am really confused how to proceed further. Please help.
>>
>> I have a dataset created as follows:
>> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>>
>> Now I need to map each name with a unique index and I did the following
>> JavaPairRDD indexedBId = business.javaRDD()
>>
>>  .zipWithIndex();
>>
>> In later part of the code I need to change a datastructure and update
>> name with index value generated above .
>> I am unable to figure out how to do a look up here..
>>
>> Please suggest /.
>>
>> If there is a better way to do this please suggest that.
>>
>> Regards
>> VG
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
This should work and I don't think triggers any actions:

df.rdd.partitions.length

On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang  wrote:

> Seems no function does this in Spark 2.0 preview?
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Ted Yu
How much heap memory do you give the driver ?

On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Given I get a stack trace in my python notebook I am guessing the driver
> is running out of memory?
>
> My app is simple it creates a list of dataFrames from s3://, and counts
> each one. I would not think this would take a lot of driver memory
>
> I am not running my code locally. Its using 12 cores. Each node has 6G.
>
> Any suggestions would be greatly appreciated
>
> Andy
>
> def work():
>
> constituentDFS = getDataFrames(constituentDataSets)
>
> results = ["{} {}".format(name, constituentDFS[name].count()) for name
> in constituentDFS]
>
> print(results)
>
> return results
>
>
> %timeit -n 1 -r 1 results = work()
>
>
>  in (.0)  1 def work():  2
>  constituentDFS = getDataFrames(constituentDataSets)> 3 results = 
> ["{} {}".format(name, constituentDFS[name].count()) for name in 
> constituentDFS]  4 print(results)  5 return results
>
>
> 16/07/22 17:54:38 WARN TaskSetManager: Stage 146 contains a task of very
> large size (145 KB). The maximum recommended task size is 100 KB.
>
> 16/07/22 18:39:47 WARN HeartbeatReceiver: Removing executor 2 with no
> recent heartbeats: 153037 ms exceeds timeout 12 ms
>
> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
> Java heap space
>
> at java.util.jar.Manifest$FastInputStream.(Manifest.java:332)
>
> at java.util.jar.Manifest$FastInputStream.(Manifest.java:327)
>
> at java.util.jar.Manifest.read(Manifest.java:195)
>
> at java.util.jar.Manifest.(Manifest.java:69)
>
> at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
>
> at java.util.jar.JarFile.getManifest(JarFile.java:180)
>
> at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.logExecutorLoss(TaskSchedulerImpl.scala:510)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:473)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:199)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:195)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> at org.apache.spark.HeartbeatReceiver.org
> $apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:195)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:118)
>
> at
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>
> 16/07/22 19:08:29 WARN NettyRpcEnv: Ignored message: true
>
>
>


Re: spark and plot data

2016-07-22 Thread Gourav Sengupta
The biggest stumbling block to using Zeppelin has been that we cannot
download the notebooks, cannot export them and certainly cannot sync them
back to Github, without mind numbing and sometimes irritating hacks. Have
those issues been resolved?


Regards,
Gourav


On Fri, Jul 22, 2016 at 2:22 PM, Pedro Rodriguez 
wrote:

> Zeppelin works great. The other thing that we have done in notebooks (like
> Zeppelin or Databricks) which support multiple types of spark session is
> register Spark SQL temp tables in our scala code then escape hatch to
> python for plotting with seaborn/matplotlib when the built in plots are
> insufficient.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> 
>
> On July 22, 2016 at 3:04:48 AM, Marco Colombo (ing.marco.colo...@gmail.com)
> wrote:
>
> Take a look at zeppelin
>
> http://zeppelin.apache.org
>
> Il giovedì 21 luglio 2016, Andy Davidson 
> ha scritto:
>
>> Hi Pseudo
>>
>> Plotting, graphing, data visualization, report generation are common
>> needs in scientific and enterprise computing.
>>
>> Can you tell me more about your use case? What is it about the current
>> process / workflow do you think could be improved by pushing plotting (I
>> assume you mean plotting and graphing) into spark.
>>
>>
>> In my personal work all the graphing is done in the driver on summary
>> stats calculated using spark. So for me using standard python libs has not
>> been a problem.
>>
>> Andy
>>
>> From: pseudo oduesp 
>> Date: Thursday, July 21, 2016 at 8:30 AM
>> To: "user @spark" 
>> Subject: spark and plot data
>>
>> Hi ,
>> i know spark  it s engine  to compute large data set but for me i work
>> with pyspark and it s very wonderful machine
>>
>> my question  we  don't have tools for ploting data each time we have to
>> switch and go back to python for using plot.
>> but when you have large result scatter plot or roc curve  you cant use
>> collect to take data .
>>
>> somone have propostion for plot .
>>
>> thanks
>>
>>
>
> --
> Ing. Marco Colombo
>
>


Re: Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
I realized that there's an error in the code. Corrected:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream



sc = SparkContext(appName="FileAutomation")

# Create streaming context from existing spark context
ssc = StreamingContext(sc, 10)
alert_stream = KinesisUtils.createStream(ssc,
 "Events",  # App Name
 "Event_Test",  # Stream Name

"https://kinesis.us-west-2.amazonaws.com;,
 "us-west-2",
 InitialPositionInStream.LATEST,
 1
 )

events = sc.broadcast(25)


def test(rdd):

global events
num = events.value
print num

events.unpersist()
events = sc.broadcast(num + 1)


alert_stream.foreachRDD(test)

# Comment this line and no error occurs
ssc.checkpoint('dir')
ssc.start()
ssc.awaitTermination()


On Fri, Jul 22, 2016 at 1:50 PM, Joe Panciera 
wrote:

> Hi,
>
> I'm attempting to use broadcast variables to update stateful values used
> across the cluster for processing. Essentially, I have a function that is
> executed in .foreachRDD that updates the broadcast variable by calling
> unpersist() and then rebroadcasting. This works without issues when I
> execute the code without checkpointing, but as soon as I include
> checkpointing it seems to be unable to pickle the function. I get this
> error:
>
> *It appears that you are attempting to reference SparkContext from a
> broadcast *
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line
> 268, in __getnewargs__
> "It appears that you are attempting to reference SparkContext from a
> broadcast "
> Exception: It appears that you are attempting to reference SparkContext
> from a broadcast variable, action, or transformation. SparkContext can only
> be used on the driver, not in code that it run on workers. For more
> information, see SPARK-5063.
>
> at
> org.apache.spark.streaming.api.python.PythonTransformFunctionSerializer$.serialize(PythonDStream.scala:144)
> at
> org.apache.spark.streaming.api.python.TransformFunction$$anonfun$writeObject$1.apply$mcV$sp(PythonDStream.scala:101)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
> ... 61 more
>
>
> Here's some simple code that shows this occurring:
>
> from pyspark import SparkContext
> from pyspark.streaming import StreamingContext
> from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
>
>
>
> sc = SparkContext(appName="FileAutomation")
>
> # Create streaming context from existing spark context
> ssc = StreamingContext(sc, 10)
> alert_stream = KinesisUtils.createStream(ssc,
>  "Events",  # App Name
>  "Event_Test",  # Stream Name
>  
> "https://kinesis.us-west-2.amazonaws.com;,
>  "us-west-2",
>  InitialPositionInStream.LATEST,
>  1
>  )
>
> events = sc.broadcast(25)
>
>
> def test(rdd):
>
> global events
> num = events.value
> print num
>
> events.unpersist()
> events = sc.broadcast(num + 1)
>
>
> events.foreachRDD(test)
>
> # Comment this line and no error occurs
> ssc.checkpoint('dir')
> ssc.start()
> ssc.awaitTermination()
>
>


Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Given I get a stack trace in my python notebook I am guessing the driver is
running out of memory?

My app is simple it creates a list of dataFrames from s3://, and counts each
one. I would not think this would take a lot of driver memory

I am not running my code locally. Its using 12 cores. Each node has 6G.

Any suggestions would be greatly appreciated

Andy

def work():

constituentDFS = getDataFrames(constituentDataSets)

results = ["{} {}".format(name, constituentDFS[name].count()) for name
in constituentDFS]

print(results)

return results



%timeit -n 1 -r 1 results = work()


 in (.0)  1 def work():  2
constituentDFS = getDataFrames(constituentDataSets)> 3 results =
["{} {}".format(name, constituentDFS[name].count()) for name in
constituentDFS]  4 print(results)  5 return results

16/07/22 17:54:38 WARN TaskSetManager: Stage 146 contains a task of very
large size (145 KB). The maximum recommended task size is 100 KB.

16/07/22 18:39:47 WARN HeartbeatReceiver: Removing executor 2 with no recent
heartbeats: 153037 ms exceeds timeout 12 ms

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
Java heap space

at java.util.jar.Manifest$FastInputStream.(Manifest.java:332)

at java.util.jar.Manifest$FastInputStream.(Manifest.java:327)

at java.util.jar.Manifest.read(Manifest.java:195)

at java.util.jar.Manifest.(Manifest.java:69)

at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)

at java.util.jar.JarFile.getManifest(JarFile.java:180)

at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)

at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)

at java.net.URLClassLoader.access$100(URLClassLoader.java:73)

at java.net.URLClassLoader$1.run(URLClassLoader.java:368)

at java.net.URLClassLoader$1.run(URLClassLoader.java:362)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:361)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at 
org.apache.spark.scheduler.TaskSchedulerImpl.logExecutorLoss(TaskSchedulerIm
pl.scala:510)

at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.
scala:473)

at 
org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiv
er$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:199)

at 
org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiv
er$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:195)

at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Travers
ableLike.scala:772)

at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:77
1)

at 
org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expir
eDeadHosts(HeartbeatReceiver.scala:195)

at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(He
artbeatReceiver.scala:118)

at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala
:104)

at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)

at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)

16/07/22 19:08:29 WARN NettyRpcEnv: Ignored message: true






Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
Hi Inam
  i sorted it.
 i reply to all, in case anyone else follow the blog and get into the same
issue

- First off, the Environment.I have tested the sample using purely
spark-1.6.1, no hive, no hadoop. I launched pyspark as follow  pyspark
--packages com.databricks:spark-csv_2.10:1.4.0

- Secondly, please note that when i do printSchema (at step 1) the column
'Churn' is listed as 'boolean', not as string like in the blog. this might
be due to the spark-csv version i am using (1.4.0)

>>> CV_data.printSchema()
root
 |-- State: string (nullable = true)
 |-- Account length: integer (nullable = true)
 |-- Area code: integer (nullable = true)
 |-- International plan: string (nullable = true)
 |-- Voice mail plan: string (nullable = true)
 |-- Number vmail messages: integer (nullable = true)
 |-- Total day minutes: double (nullable = true)
 |-- Total day calls: integer (nullable = true)
 |-- Total day charge: double (nullable = true)
 |-- Total eve minutes: double (nullable = true)
 |-- Total eve calls: integer (nullable = true)
 |-- Total eve charge: double (nullable = true)
 |-- Total night minutes: double (nullable = true)
 |-- Total night calls: integer (nullable = true)
 |-- Total night charge: double (nullable = true)
 |-- Total intl minutes: double (nullable = true)
 |-- Total intl calls: integer (nullable = true)
 |-- Total intl charge: double (nullable = true)
 |-- Customer service calls: integer (nullable = true)
 |-- Churn: boolean (nullable = true)



- Thirdly, at step 6, please replace the binary_map function with the
folloiwng

as i said,Churn is not a string columb but a boolean, and thefefore the
toNum function will fail big time.

binary_map = {'Yes':1.0, 'No':0.0, True:1.0, False:0.0}

I managed to arrive at step 7 without any issues (uhm i dont have
matplotlib so i skipped step 5, which i guess is irrelevant as it just
display the data rather than doing any logic)

Pls let me know if this fixes your problems..

hth

 marco












On Fri, Jul 22, 2016 at 6:34 PM, Inam Ur Rehman 
wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin  wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul 22, 2016, at 7:07 AM, Marco Mistroni  wrote:
>>
>> Hello Jean
>>  you can take ur current DataFrame and send them to mllib (i was doing
>> that coz i dindt know the ml package),but the process is littlebit
>> cumbersome
>>
>>
>> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
>> 2. run your ML model
>>
>> i'd suggest you stick to DataFrame + ml package :)
>>
>> hth
>>
>>
>>
>> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin  wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> j...@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
Hi,

I'm attempting to use broadcast variables to update stateful values used
across the cluster for processing. Essentially, I have a function that is
executed in .foreachRDD that updates the broadcast variable by calling
unpersist() and then rebroadcasting. This works without issues when I
execute the code without checkpointing, but as soon as I include
checkpointing it seems to be unable to pickle the function. I get this
error:

*It appears that you are attempting to reference SparkContext from a
broadcast *

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line
268, in __getnewargs__
"It appears that you are attempting to reference SparkContext from a
broadcast "
Exception: It appears that you are attempting to reference SparkContext
from a broadcast variable, action, or transformation. SparkContext can only
be used on the driver, not in code that it run on workers. For more
information, see SPARK-5063.

at
org.apache.spark.streaming.api.python.PythonTransformFunctionSerializer$.serialize(PythonDStream.scala:144)
at
org.apache.spark.streaming.api.python.TransformFunction$$anonfun$writeObject$1.apply$mcV$sp(PythonDStream.scala:101)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
... 61 more


Here's some simple code that shows this occurring:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream



sc = SparkContext(appName="FileAutomation")

# Create streaming context from existing spark context
ssc = StreamingContext(sc, 10)
alert_stream = KinesisUtils.createStream(ssc,
 "Events",  # App Name
 "Event_Test",  # Stream Name

"https://kinesis.us-west-2.amazonaws.com;,
 "us-west-2",
 InitialPositionInStream.LATEST,
 1
 )

events = sc.broadcast(25)


def test(rdd):

global events
num = events.value
print num

events.unpersist()
events = sc.broadcast(num + 1)


events.foreachRDD(test)

# Comment this line and no error occurs
ssc.checkpoint('dir')
ssc.start()
ssc.awaitTermination()


How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Seems no function does this in Spark 2.0 preview?


Distributed Matrices - spark mllib

2016-07-22 Thread Gourav Sengupta
Hi,

I had a sparse matrix and I wanted to add the value of a particular row
which is identified by a particular number.

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
mat =
CoordinateMatrix(all_scores_df.select('ID_1','ID_2','value').rdd.map(lambda
row: MatrixEntry(*row)))


This gives me the number or rows and columns. But I am not able to extract
the values and it always reports back the error:

AttributeError: 'NoneType' object has no attribute 'setCallSite'


Thanks and Regards,

Gourav Sengupta


Spark, Scala, and DNA sequencing

2016-07-22 Thread James McCabe

Hi!

I hope this may be of use/interest to someone:

Spark, a Worked Example: Speeding Up DNA Sequencing

http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html 



James


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ml models distribution

2016-07-22 Thread Chris Fregly
hey everyone-

this concept of deploying your Spark ML Pipelines and Algos into Production
(user-facing production) has been coming up a lot recently.

so much so, that i've dedicated the last few months of my research and
engineering efforts to build out the infrastructure to support this in a
highly-scalable, highly-available way.

i've combined my Netflix + NetflixOSS work experience with my
Databricks/IBM + Spark work experience into an open source project,
PipelineIO, here:  http://pipeline.io

we're even serving up TensorFlow AI models using the same infrastructure -
incorporating key patterns from TensorFlow Distributed + TensorFlow Serving!

everything is open source, based on Docker + Kubernetes + NetflixOSS +
Spark + TensorFlow + Redis + Hybrid Cloud + On-Premise + Kafka + Zeppelin +
Jupyter/iPython with a heavy emphasis on metrics and monitoring of models
and server production statistics.

we're doing code generation directly from the saved Spark ML models (thanks
Spark 2.0 for giving us save/load parity across all models!) for optimized
model serving using both CPUs and GPUs, incremental training of models,
autoscaling, the whole works.

our friend from Netflix, Chaos Monkey, even makes a grim appearance from
time to time to prove that we're resilient to failure.

take a peek.  it's cool.  we've come a long way in the last couple months,
and we've got a lot of work left to do, but the core infrastructure is in
place, key features have been built, and we're moving quickly.

shoot me an email if you'd like to get involved.  lots of TODO's.

we're dedicating my upcoming Advanced Spark and TensorFlow Meetup on August
4th in SF to demo'ing this infrastructure to you all.

here's the link:
http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/231457813/


video recording + screen capture will be posted afterward, as always.

we've got a workshop dedicated to building an end-to-end Spark ML and
Kafka-based Recommendation Pipeline - including the PipelineIO serving
platform.  link is here:  http://pipeline.io

and i'm finishing a blog post soon to detail everything we've done so far -
and everything we're actively building.  this post will be available on
http://pipeline.io - as well as cross-posted to a number of my favorite
engineering blogs.

global demo roadshow starts 8/8.  shoot me an email if you want to see all
this in action, otherwise i'll see you at a workshop or meetup near you!  :)



On Fri, Jul 22, 2016 at 10:34 AM, Inam Ur Rehman 
wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution.. plz
>
> On Fri, Jul 22, 2016 at 6:12 PM, Sean Owen  wrote:
>
>> No there isn't anything in particular, beyond the various bits of
>> serialization support that write out something to put in your storage
>> to begin with. What you do with it after reading and before writing is
>> up to your app, on purpose.
>>
>> If you mean you're producing data outside the model that your model
>> uses, your model data might be produced by an RDD operation, and saved
>> that way. There it's no different than anything else you do with RDDs.
>>
>> What part are you looking to automate beyond those things? that's most of
>> it.
>>
>> On Fri, Jul 22, 2016 at 2:04 PM, Sergio Fernández 
>> wrote:
>> > Hi Sean,
>> >
>> > On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen  wrote:
>> >>
>> >> If you mean, how do you distribute a new model in your application,
>> >> then there's no magic to it. Just reference the new model in the
>> >> functions you're executing in your driver.
>> >>
>> >> If you implemented some other manual way of deploying model info, just
>> >> do that again. There's no special thing to know.
>> >
>> >
>> > Well, because some huge model, we typically bundle both logic
>> > (pipeline/application)  and models separately. Normally we use a shared
>> > stores (e.g., HDFS) or coordinated distribution of the models. But I
>> wanted
>> > to know if there is any infrastructure in Spark that specifically
>> addresses
>> > such need.
>> >
>> > Thanks.
>> >
>> > Cheers,
>> >
>> > P.S.: sorry Jacek, with "ml" I meant "Machine Learning". I thought is a
>> > quite spread acronym. Sorry for the possible confusion.
>> >
>> >
>> > --
>> > Sergio Fernández
>> > Partner Technology Manager
>> > Redlink GmbH
>> > m: +43 6602747925
>> > e: sergio.fernan...@redlink.co
>> > w: http://redlink.co
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com


Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
Scaladoc is already in the code, just not the html docs

On Fri, Jul 22, 2016 at 1:46 PM, Srikanth  wrote:
> Yeah, that's what I thought. We need to redefine not just restart.
> Thanks for the info!
>
> I do see the usage of subscribe[K,V] in your DStreams example.
> Looks simple but its not very obvious how it works :-)
> I'll watch out for the docs and ScalaDoc.
>
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger  wrote:
>>
>> No, restarting from a checkpoint won't do it, you need to re-define the
>> stream.
>>
>> Here's the jira for the 0.10 integration
>>
>> https://issues.apache.org/jira/browse/SPARK-12177
>>
>> I haven't gotten docs completed yet, but there are examples at
>>
>> https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10
>>
>> On Fri, Jul 22, 2016 at 1:05 PM, Srikanth  wrote:
>> > In Spark 1.x, if we restart from a checkpoint, will it read from new
>> > partitions?
>> >
>> > If you can, pls point us to some doc/link that talks about Kafka 0.10
>> > integ
>> > in Spark 2.0.
>> >
>> > On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger 
>> > wrote:
>> >>
>> >> For the integration for kafka 0.8, you are literally starting a
>> >> streaming job against a fixed set of topicapartitions,  It will not
>> >> change throughout the job, so you'll need to restart the spark job if
>> >> you change kafka partitions.
>> >>
>> >> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
>> >> or subscribepattern, it should pick up new partitions as they are
>> >> added.
>> >>
>> >> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth 
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'd like to understand how Spark Streaming(direct) would handle Kafka
>> >> > partition addition?
>> >> > Will a running job be aware of new partitions and read from it?
>> >> > Since it uses Kafka APIs to query offsets and offsets are handled
>> >> > internally.
>> >> >
>> >> > Srikanth
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
Yeah, that's what I thought. We need to redefine not just restart.
Thanks for the info!

I do see the usage of subscribe[K,V] in your DStreams example.
Looks simple but its not very obvious how it works :-)
I'll watch out for the docs and ScalaDoc.

Srikanth

On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger  wrote:

> No, restarting from a checkpoint won't do it, you need to re-define the
> stream.
>
> Here's the jira for the 0.10 integration
>
> https://issues.apache.org/jira/browse/SPARK-12177
>
> I haven't gotten docs completed yet, but there are examples at
>
> https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10
>
> On Fri, Jul 22, 2016 at 1:05 PM, Srikanth  wrote:
> > In Spark 1.x, if we restart from a checkpoint, will it read from new
> > partitions?
> >
> > If you can, pls point us to some doc/link that talks about Kafka 0.10
> integ
> > in Spark 2.0.
> >
> > On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger 
> wrote:
> >>
> >> For the integration for kafka 0.8, you are literally starting a
> >> streaming job against a fixed set of topicapartitions,  It will not
> >> change throughout the job, so you'll need to restart the spark job if
> >> you change kafka partitions.
> >>
> >> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
> >> or subscribepattern, it should pick up new partitions as they are
> >> added.
> >>
> >> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth 
> wrote:
> >> > Hello,
> >> >
> >> > I'd like to understand how Spark Streaming(direct) would handle Kafka
> >> > partition addition?
> >> > Will a running job be aware of new partitions and read from it?
> >> > Since it uses Kafka APIs to query offsets and offsets are handled
> >> > internally.
> >> >
> >> > Srikanth
> >
> >
>


Hive Exception

2016-07-22 Thread Inam Ur Rehman
Hi All
I am really stuck here. i know this has been asked before but it just wont
solve for me. I am using anaconda distribution 3.5 and and i have build
spark-1.6.2 two times 1st time with hive and JDBC support through this
command
*mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
-DskipTests clean package*  it gives hive exception
and 2nd time through this command
*./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4
-Phive -Phive-thriftserver -Pyarn *it is also giving me exception.
i have also tried spark pre built version spark-1.6.1-bin-hadoop2.6 but the
exception remains the same..
the things i've tried to solve this
1) place hive-site.xml in spark\cpnf folder it was not there before.
2) set SPARK_HIVE = true
3) run sbt assembly
but the problem is still there.

here is the full error
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt
assembly
---
Py4JJavaError Traceback (most recent call last)
 in ()
  3
  4 binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
> 5 toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
  6
  7 CV_data = CV_data.drop('State').drop('Area code') .drop('Total
day charge').drop('Total eve charge') .drop('Total night
charge').drop('Total intl charge') .withColumn('Churn',
toNum(CV_data['Churn'])) .withColumn('International plan',
toNum(CV_data['International plan'])) .withColumn('Voice mail plan',
toNum(CV_data['Voice mail plan'])).cache()

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\functions.py
in __init__(self, func, returnType, name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559
   1560 def _create_judf(self, name):

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\functions.py
in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes =
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else
f.__class__.__name__

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\context.py
in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\context.py
in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py
in __call__(self, *args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065
   1066 for temp_arg in temp_args:

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\pyspark\sql\utils.py in
deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

C:\Users\InAm-Ur-Rehman\Sparkkk\spark-1.6.2\python\lib\py4j-0.9-src.zip\py4j\protocol.py
in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204


Re: Load selected rows with sqlContext in the dataframe

2016-07-22 Thread sujeet jog
Thanks Todd.

On Thu, Jul 21, 2016 at 9:18 PM, Todd Nist  wrote:

> You can set the dbtable to this:
>
> .option("dbtable", "(select * from master_schema where 'TID' = '100_0')")
>
> HTH,
>
> Todd
>
>
> On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog  wrote:
>
>> I have a table of size 5GB, and want to load selective rows into
>> dataframe instead of loading the entire table in memory,
>>
>>
>> For me memory is a constraint hence , and i would like to peridically
>> load few set of rows and perform dataframe operations on it,
>>
>> ,
>> for the "dbtable"  is there a way to perform select * from master_schema
>> where 'TID' = '100_0';
>> which can load only this to memory as dataframe .
>>
>>
>>
>> Currently  I'm using code as below
>> val df  =  sqlContext.read .format("jdbc")
>>   .option("url", url)
>>   .option("dbtable", "master_schema").load()
>>
>>
>> Thansk,
>> Sujeet
>>
>
>


Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
How did you build your spark distribution?
Could you detail the steps?
Hive afaik is dependent on hadoop. If you don't configure ur spark
correctly it will assume hadoop is ur filesystem...
I m not using hadoop or hive.u might want to get a cloudera
distribution which has spark hadoop and hive by default
Hth

On 22 Jul 2016 6:34 pm, "Inam Ur Rehman"  wrote:

> Hello guys..i know its irrelevant to this topic but i've been looking
> desperately for the solution. I am facing en exception
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin  wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul 22, 2016, at 7:07 AM, Marco Mistroni  wrote:
>>
>> Hello Jean
>>  you can take ur current DataFrame and send them to mllib (i was doing
>> that coz i dindt know the ml package),but the process is littlebit
>> cumbersome
>>
>>
>> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
>> 2. run your ML model
>>
>> i'd suggest you stick to DataFrame + ml package :)
>>
>> hth
>>
>>
>>
>> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin  wrote:
>>
>>> Hi,
>>>
>>> I am looking for some really super basic examples of MLlib (like a
>>> linear regression over a list of values) in Java. I have found a few, but I
>>> only saw them using JavaRDD... and not DataFrame.
>>>
>>> I was kind of hoping to take my current DataFrame and send them in
>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>>
>>> Jean Georges Perrin
>>> j...@jgp.net / @jgperrin
>>>
>>>
>>>
>>>
>>>
>>
>>
>


How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread VG
Any suggestions here  please

I basically need an ability to look up *name -> index* and *index -> name*
in the code

-VG

On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:

> Hi All,
>
> I am really confused how to proceed further. Please help.
>
> I have a dataset created as follows:
> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>
> Now I need to map each name with a unique index and I did the following
> JavaPairRDD indexedBId = business.javaRDD()
>.zipWithIndex();
>
> In later part of the code I need to change a datastructure and update name
> with index value generated above .
> I am unable to figure out how to do a look up here..
>
> Please suggest /.
>
> If there is a better way to do this please suggest that.
>
> Regards
> VG
>
>


Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Marco Mistroni
Hi
So u u have a data frame, then use zipwindex and create a tuple 
I m not sure if df API has something useful for zip w index.
But u can
- get a data frame
- convert it to rdd (there's a tordd )
- do a zip with index

That will give u a rdd with 3 fields...
I don't think you can update df columns
Hth
On 22 Jul 2016 5:19 pm, "VG"  wrote:

>

> Hi All,
>
> Any suggestions for this
>
> Regards,
> VG
>
> On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:

>>

>> Hi All,
>>
>> I am really confused how to proceed further. Please help.
>>
>> I have a dataset created as follows:
>> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>>
>> Now I need to map each name with a unique index and I did the following
>> JavaPairRDD indexedBId = business.javaRDD()
>>
 .zipWithIndex();
>>
>> In later part of the code I need to change a datastructure and update
name with index value generated above .
>> I am unable to figure out how to do a look up here..
>>
>> Please suggest /.
>>
>> If there is a better way to do this please suggest that.
>>
>> Regards
>> VG
>>
>


Re: Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
You're right, it's the save behavior... Oh well... I wanted something easy :(

> On Jul 22, 2016, at 12:41 PM, Everett Anderson  > wrote:
> 
> Actually, sorry, my mistake, you're calling
> 
>   DataFrame df = sqlContext.createDataFrame(data, 
> org.apache.spark.sql.types.NumericType.class);
> 
> and giving it a list of objects which aren't NumericTypes, but the wildcards 
> in the signature let it happen.
> 
> I'm curious what'd happen if you gave it Integer.class, but I suspect it 
> still won't work because Integer may not have the bean-style getters.
> 
> 
> On Fri, Jul 22, 2016 at 9:37 AM, Everett Anderson  > wrote:
> Hey,
> 
> I think what's happening is that you're calling this createDataFrame method 
> :
> 
> createDataFrame(java.util.List data, java.lang.Class beanClass) 
> 
> which expects a JavaBean-style class with get and set methods for the 
> members, but Integer doesn't have such a getter. 
> 
> I bet there's an easier way if you just want a single-column DataFrame of a 
> primitive type, but one way that would work is to manually construct the Rows 
> using RowFactory.create() 
> 
>  and assemble the DataFrame from that like
> 
> List rows = convert your List to this in a loop with 
> RowFactory.create()
> 
> StructType schema = DataTypes.createStructType(Collections.singletonList(
>  DataTypes.createStructField("int_field", DataTypes.IntegerType, true)));
> 
> DataFrame intDataFrame = sqlContext.createDataFrame(rows, schema);
> 
> 
> 
> On Fri, Jul 22, 2016 at 7:53 AM, Jean Georges Perrin  > wrote:
> 
> 
> I am trying to build a DataFrame from a list, here is the code:
> 
>   private void start() {
>   SparkConf conf = new SparkConf().setAppName("Data Set from 
> Array").setMaster("local");
>   SparkContext sc = new SparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
> 
>   Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
>   List data = Arrays.asList(l);
> 
>   System.out.println(data);
>   
>   DataFrame df = sqlContext.createDataFrame(data, 
> org.apache.spark.sql.types.NumericType.class);
>   df.show();
>   }
> 
> My result is (unpleasantly):
> 
> [1, 2, 3, 4, 5, 6, 7]
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> 
> I also tried with:
> org.apache.spark.sql.types.NumericType.class
> org.apache.spark.sql.types.IntegerType.class
> org.apache.spark.sql.types.ArrayType.class
> 
> I am probably missing something super obvious :(
> 
> Thanks!
> 
> jg
> 
> 
> 
> 



Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
No, restarting from a checkpoint won't do it, you need to re-define the stream.

Here's the jira for the 0.10 integration

https://issues.apache.org/jira/browse/SPARK-12177

I haven't gotten docs completed yet, but there are examples at

https://github.com/koeninger/kafka-exactly-once/tree/kafka-0.10

On Fri, Jul 22, 2016 at 1:05 PM, Srikanth  wrote:
> In Spark 1.x, if we restart from a checkpoint, will it read from new
> partitions?
>
> If you can, pls point us to some doc/link that talks about Kafka 0.10 integ
> in Spark 2.0.
>
> On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger  wrote:
>>
>> For the integration for kafka 0.8, you are literally starting a
>> streaming job against a fixed set of topicapartitions,  It will not
>> change throughout the job, so you'll need to restart the spark job if
>> you change kafka partitions.
>>
>> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
>> or subscribepattern, it should pick up new partitions as they are
>> added.
>>
>> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth  wrote:
>> > Hello,
>> >
>> > I'd like to understand how Spark Streaming(direct) would handle Kafka
>> > partition addition?
>> > Will a running job be aware of new partitions and read from it?
>> > Since it uses Kafka APIs to query offsets and offsets are handled
>> > internally.
>> >
>> > Srikanth
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
In Spark 1.x, if we restart from a checkpoint, will it read from new
partitions?

If you can, pls point us to some doc/link that talks about Kafka 0.10 integ
in Spark 2.0.

On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger  wrote:

> For the integration for kafka 0.8, you are literally starting a
> streaming job against a fixed set of topicapartitions,  It will not
> change throughout the job, so you'll need to restart the spark job if
> you change kafka partitions.
>
> For the integration for kafka 0.10 / spark 2.0, if you use subscribe
> or subscribepattern, it should pick up new partitions as they are
> added.
>
> On Fri, Jul 22, 2016 at 11:29 AM, Srikanth  wrote:
> > Hello,
> >
> > I'd like to understand how Spark Streaming(direct) would handle Kafka
> > partition addition?
> > Will a running job be aware of new partitions and read from it?
> > Since it uses Kafka APIs to query offsets and offsets are handled
> > internally.
> >
> > Srikanth
>


Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-22 Thread Neil Chang
Thank you guys, it is the port issue.

On Wed, Jul 20, 2016 at 11:03 AM, Igor Berman  wrote:

> in addition check what ip the master is binding to(with nestat)
>
> On 20 July 2016 at 06:12, Andrew Ehrlich  wrote:
>
>> Troubleshooting steps:
>>
>> $ telnet localhost 7077 (on master, to confirm port is open)
>> $ telnet  7077 (on slave, to confirm port is blocked)
>>
>> If the port is available on the master from the master, but not on the
>> master from the slave, check firewall settings on the master:
>> https://help.ubuntu.com/lts/serverguide/firewall.html
>>
>> On Jul 19, 2016, at 6:25 PM, Neil Chang  wrote:
>>
>> Hi,
>>   I have two virtual pcs on private cloud (ubuntu 14). I installed spark
>> 2.0 preview on both machines. I then tried to test it with standalone mode.
>> I have no problem start the master. However, when I start the worker
>> (slave) on another machine, it makes many attempts to connect to master and
>> failed at the end.
>>   I can ssh from each machine to another without any problem. I can also
>> run a master and worker at the same machine without any problem.
>>
>> What did I miss? Any clue?
>>
>> here are the messages:
>>
>> WARN NativeCodeLoader: Unable to load native-hadoop library for your
>> platform ... using builtin-java classes where applicable
>> ..
>> INFO Worker: Connecting to master ip:7077 ...
>> INFO Worker: Retrying connection to master (attempt #1)
>> ..
>> INFO Worker: Retrying connection to master (attempt #7)
>> java.lang.IllegalArgumentException: requirement failed: TransportClient
>> has not yet been set.
>>at scala.Predef$.require(Predef.scala:224)
>> ...
>> WARN NettyRocEnv: Ignored failure: java.io.IOException: Connecting to
>> ip:7077 timed out
>> WARN Worker: Failed to connect to master ip.7077
>>
>>
>>
>> Thanks,
>> Neil
>>
>>
>>
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Great. thanks a ton for helping out on this Sean.
I somehow messed this up (and was running in loops for last 2 hours )

thanks again

-VG

On Fri, Jul 22, 2016 at 11:28 PM, Sean Owen  wrote:

> You mark these provided, which is correct. If the version of Scala
> provided at runtime differs, you'll have a problem.
>
> In fact you can also see you mixed Scala versions in your dependencies
> here. MLlib is on 2.10.
>
> On Fri, Jul 22, 2016 at 6:49 PM, VG  wrote:
> > Sean,
> >
> > I am only using the maven dependencies for spark in my pom file.
> > I don't have anything else. I guess maven dependency should resolve to
> the
> > correct scala version .. isn;t it ? Any ideas.
> >
> > 
> > org.apache.spark
> > spark-core_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-sql_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-streaming_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-mllib_2.10
> > 2.0.0-preview
> > provided
> > 
> >
> >
> >
> > On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen  wrote:
> >>
> >> -dev
> >> Looks like you are mismatching the version of Spark you deploy on at
> >> runtime then. Sounds like it was built for Scala 2.10
> >>
> >> On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
> >> > Using 2.0.0-preview using maven
> >> > So all dependencies should be correct I guess
> >> >
> >> > 
> >> > org.apache.spark
> >> > spark-core_2.11
> >> > 2.0.0-preview
> >> > provided
> >> > 
> >> >
> >> > I see in maven dependencies that this brings in
> >> > scala-reflect-2.11.4
> >> > scala-compiler-2.11.0
> >> >
> >> > and so on
> >> >
> >> >
> >> >
> >> > On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici  >
> >> > wrote:
> >> >>
> >> >> What version of Spark/Scala are you running?
> >> >>
> >> >>
> >> >>
> >> >> -Aaron
> >> >
> >> >
> >
> >
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
Your error stems from spark.ml, and in your pom mllib is the only dependency 
that is 2.10. Is there a reason for this? IE, you tell maven mllib 2.10 is 
provided at runtime. Is 2.10 on the machine, or is 2.11?

-Aaron

From: VG 
Date: Friday, July 22, 2016 at 1:49 PM
To: Sean Owen 
Cc: User 
Subject: Re: Error in running JavaALSExample example from spark examples

Sean,

I am only using the maven dependencies for spark in my pom file.
I don't have anything else. I guess maven dependency should resolve to the 
correct scala version .. isn;t it ? Any ideas.


org.apache.spark
spark-core_2.11
2.0.0-preview
provided



org.apache.spark
spark-sql_2.11
2.0.0-preview
provided



org.apache.spark
spark-streaming_2.11
2.0.0-preview
provided



org.apache.spark
spark-mllib_2.10
2.0.0-preview
provided




On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen 
> wrote:
-dev
Looks like you are mismatching the version of Spark you deploy on at
runtime then. Sounds like it was built for Scala 2.10

On Fri, Jul 22, 2016 at 6:43 PM, VG 
> wrote:
> Using 2.0.0-preview using maven
> So all dependencies should be correct I guess
>
> 
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> provided
> 
>
> I see in maven dependencies that this brings in
> scala-reflect-2.11.4
> scala-compiler-2.11.0
>
> and so on
>
>
>
> On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
> >
> wrote:
>>
>> What version of Spark/Scala are you running?
>>
>>
>>
>> -Aaron
>
>



Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Sean Owen
You mark these provided, which is correct. If the version of Scala
provided at runtime differs, you'll have a problem.

In fact you can also see you mixed Scala versions in your dependencies
here. MLlib is on 2.10.

On Fri, Jul 22, 2016 at 6:49 PM, VG  wrote:
> Sean,
>
> I am only using the maven dependencies for spark in my pom file.
> I don't have anything else. I guess maven dependency should resolve to the
> correct scala version .. isn;t it ? Any ideas.
>
> 
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> provided
> 
> 
> 
> org.apache.spark
> spark-sql_2.11
> 2.0.0-preview
> provided
> 
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0-preview
> provided
> 
> 
> 
> org.apache.spark
> spark-mllib_2.10
> 2.0.0-preview
> provided
> 
>
>
>
> On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen  wrote:
>>
>> -dev
>> Looks like you are mismatching the version of Spark you deploy on at
>> runtime then. Sounds like it was built for Scala 2.10
>>
>> On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
>> > Using 2.0.0-preview using maven
>> > So all dependencies should be correct I guess
>> >
>> > 
>> > org.apache.spark
>> > spark-core_2.11
>> > 2.0.0-preview
>> > provided
>> > 
>> >
>> > I see in maven dependencies that this brings in
>> > scala-reflect-2.11.4
>> > scala-compiler-2.11.0
>> >
>> > and so on
>> >
>> >
>> >
>> > On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
>> > wrote:
>> >>
>> >> What version of Spark/Scala are you running?
>> >>
>> >>
>> >>
>> >> -Aaron
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Sean,

I am only using the maven dependencies for spark in my pom file.
I don't have anything else. I guess maven dependency should resolve to the
correct scala version .. isn;t it ? Any ideas.


org.apache.spark
spark-core_2.11
2.0.0-preview
provided



org.apache.spark
spark-sql_2.11
2.0.0-preview
provided



org.apache.spark
spark-streaming_2.11
2.0.0-preview
provided



org.apache.spark
spark-mllib_2.10
2.0.0-preview
provided




On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen  wrote:

> -dev
> Looks like you are mismatching the version of Spark you deploy on at
> runtime then. Sounds like it was built for Scala 2.10
>
> On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
> > Using 2.0.0-preview using maven
> > So all dependencies should be correct I guess
> >
> > 
> > org.apache.spark
> > spark-core_2.11
> > 2.0.0-preview
> > provided
> > 
> >
> > I see in maven dependencies that this brings in
> > scala-reflect-2.11.4
> > scala-compiler-2.11.0
> >
> > and so on
> >
> >
> >
> > On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
> > wrote:
> >>
> >> What version of Spark/Scala are you running?
> >>
> >>
> >>
> >> -Aaron
> >
> >
>


Re: NoClassDefFoundError with ZonedDateTime

2016-07-22 Thread Jacek Laskowski
On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu  wrote:
> You can use this command (assuming log aggregation is turned on):
>
> yarn logs --applicationId XX

I don't think it's gonna work for already-running application (and I
wish I were mistaken since I needed it just yesterday) and you have to
revert to stderr of ApplicationMaster in container 1.

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Sean Owen
-dev
Looks like you are mismatching the version of Spark you deploy on at
runtime then. Sounds like it was built for Scala 2.10

On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
> Using 2.0.0-preview using maven
> So all dependencies should be correct I guess
>
> 
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> provided
> 
>
> I see in maven dependencies that this brings in
> scala-reflect-2.11.4
> scala-compiler-2.11.0
>
> and so on
>
>
>
> On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
> wrote:
>>
>> What version of Spark/Scala are you running?
>>
>>
>>
>> -Aaron
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution..

On Fri, Jul 22, 2016 at 10:43 PM, VG  wrote:

> Using 2.0.0-preview using maven
> So all dependencies should be correct I guess
>
> 
> org.apache.spark
> spark-core_2.11
> 2.0.0-preview
> provided
> 
>
> I see in maven dependencies that this brings in
> scala-reflect-2.11.4
> scala-compiler-2.11.0
>
> and so on
>
>
>
> On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
> wrote:
>
>> What version of Spark/Scala are you running?
>>
>>
>>
>> -Aaron
>>
>
>


Re: Programmatic use of UDFs from Java

2016-07-22 Thread Everett Anderson
Thanks for the pointer, Bryan! Sounds like I was on the right track in
terms of what's available for now.

(And Gourav -- I'm certainly interested in migrating to Scala, but our team
is mostly Java, Python, and R based right now!)


On Thu, Jul 21, 2016 at 11:00 PM, Bryan Cutler  wrote:

> Everett, I had the same question today and came across this old thread.
> Not sure if there has been any more recent work to support this.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html
>
>
> On Thu, Jul 21, 2016 at 10:10 AM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi,
>>
>> In the Java Spark DataFrames API, you can create a UDF, register it, and
>> then access it by string name by using the convenience UDF classes in
>> org.apache.spark.sql.api.java
>> 
>> .
>>
>> Example
>>
>> UDF1 testUdf1 = new UDF1<>() { ... }
>>
>> sqlContext.udf().register("testfn", testUdf1, DataTypes.LongType);
>>
>> DataFrame df2 = df.withColumn("new_col", *functions.callUDF("testfn"*,
>> df.col("old_col")));
>>
>> However, I'd like to avoid registering these by name, if possible, since
>> I have many of them and would need to deal with name conflicts.
>>
>> There are udf() methods like this that seem to be from the Scala API
>> ,
>> where you don't have to register everything by name first.
>>
>> However, using those methods from Java would require interacting with
>> Scala's scala.reflect.api.TypeTags.TypeTag. I'm having a hard time
>> figuring out how to create a TypeTag from Java.
>>
>> Does anyone have an example of using the udf() methods from Java?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Using 2.0.0-preview using maven
So all dependencies should be correct I guess


org.apache.spark
spark-core_2.11
2.0.0-preview
provided


I see in maven dependencies that this brings in
scala-reflect-2.11.4
scala-compiler-2.11.0

and so on



On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
wrote:

> What version of Spark/Scala are you running?
>
>
>
> -Aaron
>


Re: running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution..

On Fri, Jul 22, 2016 at 10:07 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Pseudo
>
> I do not know much about zeppelin . What languages are you using?
>
> I have been doing my data exploration and graphing using python mostly
> because early on spark had good support for python. Its easy to collect()
> data as a local PANDAS object. I think at this point R should work well.
> You should be able to easily collect() your data as a R dataframe. I have
> not tried to Rstudio.
>
> I typically run the Jupiter notebook server in my data center. I find the
> notebooks really nice. I typically use matplotlib to generates my graph.
> There are a lot of graphing packages.
>
> *Attached is the script I use to start the notebook server*. This script
> and process works but is a little hacky You call it as follows
>
>
> #
> # on a machine in your cluster
> #
> $ cd dirWithNotebooks
>
> # all the logs will be in startIPythonNotebook.sh.out
> # nohup allows you to log in start your notebook server and log out.
> $ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &
>
> #
> # on you local machine
> #
>
> # because of firewalls I need to open an ssh tunnel
> $ ssh -o ServerAliveInterval=120 -N -f -L localhost:8889:localhost:7000
> myCluster
>
> # connect to the notebook server using the browser of you choice
>
> http://localhost:8889
>
>
>
> #
> # If you need to stop your notebooks server you may need to kill the server
> # there is probably a cleaner way to do this
> # $ ps -el | head -1; ps -efl | grep python
> #
>
> http://jupyter.org/
>
>
> P.S. Jupiter is in the process of being released. The new Juypter lab
> alpha was just announced it looks really sweet.
>
>
>
> From: pseudo oduesp 
> Date: Friday, July 22, 2016 at 2:08 AM
> To: Andrew Davidson 
> Subject: Re: spark and plot data
>
> HI andy  ,
> thanks for reply ,
> i tell it just hard to each time switch  between local concept and
> destributed concept , for example zepplin give easy way to interact with
> data ok , but it's hard to configure on huge cluster with lot of node in my
> case i have cluster with 69 nodes and i process huge volume of data with
> pyspark and it cool but when  i want to plot some chart  i get hard job to
> make it .
>
> i sampling my result or aggregate  , take for example if i user
> randomforest algorithme in machine learning  i want to retrive  most
> importante features with my version alerady installed in our cluster
> (1.5.0) i can't get this.
>
> do you have any solution.
>
> Thanks
>
> 2016-07-21 18:44 GMT+02:00 Andy Davidson :
>
>> Hi Pseudo
>>
>> Plotting, graphing, data visualization, report generation are common
>> needs in scientific and enterprise computing.
>>
>> Can you tell me more about your use case? What is it about the current
>> process / workflow do you think could be improved by pushing plotting (I
>> assume you mean plotting and graphing) into spark.
>>
>>
>> In my personal work all the graphing is done in the driver on summary
>> stats calculated using spark. So for me using standard python libs has not
>> been a problem.
>>
>> Andy
>>
>> From: pseudo oduesp 
>> Date: Thursday, July 21, 2016 at 8:30 AM
>> To: "user @spark" 
>> Subject: spark and plot data
>>
>> Hi ,
>> i know spark  it s engine  to compute large data set but for me i work
>> with pyspark and it s very wonderful machine
>>
>> my question  we  don't have tools for ploting data each time we have to
>> switch and go back to python for using plot.
>> but when you have large result scatter plot or roc curve  you cant use
>> collect to take data .
>>
>> somone have propostion for plot .
>>
>> thanks
>>
>>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Re: Integration tests for Spark Streaming

2016-07-22 Thread Lars Albertsson
You can find useful discussions in the list archives. I wrote this, which
might help you:
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/tV2hWF

On Jun 29, 2016 07:02, "Luciano Resende"  wrote:

> This thread might be useful for what you want:
> https://www.mail-archive.com/user%40spark.apache.org/msg34673.html
>
> On Tue, Jun 28, 2016 at 1:25 PM, SRK  wrote:
>
>> Hi,
>>
>> I need to write some integration tests for my Spark Streaming app. Any
>> example on how to do this would be of great help.
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Integration-tests-for-Spark-Streaming-tp27246.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: ml models distribution

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution.. plz

On Fri, Jul 22, 2016 at 6:12 PM, Sean Owen  wrote:

> No there isn't anything in particular, beyond the various bits of
> serialization support that write out something to put in your storage
> to begin with. What you do with it after reading and before writing is
> up to your app, on purpose.
>
> If you mean you're producing data outside the model that your model
> uses, your model data might be produced by an RDD operation, and saved
> that way. There it's no different than anything else you do with RDDs.
>
> What part are you looking to automate beyond those things? that's most of
> it.
>
> On Fri, Jul 22, 2016 at 2:04 PM, Sergio Fernández 
> wrote:
> > Hi Sean,
> >
> > On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen  wrote:
> >>
> >> If you mean, how do you distribute a new model in your application,
> >> then there's no magic to it. Just reference the new model in the
> >> functions you're executing in your driver.
> >>
> >> If you implemented some other manual way of deploying model info, just
> >> do that again. There's no special thing to know.
> >
> >
> > Well, because some huge model, we typically bundle both logic
> > (pipeline/application)  and models separately. Normally we use a shared
> > stores (e.g., HDFS) or coordinated distribution of the models. But I
> wanted
> > to know if there is any infrastructure in Spark that specifically
> addresses
> > such need.
> >
> > Thanks.
> >
> > Cheers,
> >
> > P.S.: sorry Jacek, with "ml" I meant "Machine Learning". I thought is a
> > quite spread acronym. Sorry for the possible confusion.
> >
> >
> > --
> > Sergio Fernández
> > Partner Technology Manager
> > Redlink GmbH
> > m: +43 6602747925
> > e: sergio.fernan...@redlink.co
> > w: http://redlink.co
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: MLlib, Java, and DataFrame

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution..plz

On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin  wrote:

> Thanks Marco - I like the idea of sticking with DataFrames ;)
>
>
> On Jul 22, 2016, at 7:07 AM, Marco Mistroni  wrote:
>
> Hello Jean
>  you can take ur current DataFrame and send them to mllib (i was doing
> that coz i dindt know the ml package),but the process is littlebit
> cumbersome
>
>
> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
> 2. run your ML model
>
> i'd suggest you stick to DataFrame + ml package :)
>
> hth
>
>
>
> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> I am looking for some really super basic examples of MLlib (like a linear
>> regression over a list of values) in Java. I have found a few, but I only
>> saw them using JavaRDD... and not DataFrame.
>>
>> I was kind of hoping to take my current DataFrame and send them in MLlib.
>> Am I too optimistic? Do you know/have any example like that?
>>
>> Thanks!
>>
>> jg
>>
>>
>> Jean Georges Perrin
>> j...@jgp.net / @jgperrin
>>
>>
>>
>>
>>
>
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
What version of Spark/Scala are you running?

-Aaron


Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
For the integration for kafka 0.8, you are literally starting a
streaming job against a fixed set of topicapartitions,  It will not
change throughout the job, so you'll need to restart the spark job if
you change kafka partitions.

For the integration for kafka 0.10 / spark 2.0, if you use subscribe
or subscribepattern, it should pick up new partitions as they are
added.

On Fri, Jul 22, 2016 at 11:29 AM, Srikanth  wrote:
> Hello,
>
> I'd like to understand how Spark Streaming(direct) would handle Kafka
> partition addition?
> Will a running job be aware of new partitions and read from it?
> Since it uses Kafka APIs to query offsets and offsets are handled
> internally.
>
> Srikanth

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
I am getting the following error

Exception in thread "main" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)

Any suggestions to resolve this

VG


running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Andy Davidson
Hi Pseudo

I do not know much about zeppelin . What languages are you using?

I have been doing my data exploration and graphing using python mostly
because early on spark had good support for python. Its easy to collect()
data as a local PANDAS object. I think at this point R should work well. You
should be able to easily collect() your data as a R dataframe. I have not
tried to Rstudio.

I typically run the Jupiter notebook server in my data center. I find the
notebooks really nice. I typically use matplotlib to generates my graph.
There are a lot of graphing packages.

Attached is the script I use to start the notebook server. This script and
process works but is a little hacky You call it as follows


#
# on a machine in your cluster
#
$ cd dirWithNotebooks

# all the logs will be in startIPythonNotebook.sh.out
# nohup allows you to log in start your notebook server and log out.
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

#
# on you local machine
#

# because of firewalls I need to open an ssh tunnel
$ ssh -o ServerAliveInterval=120 -N -f -L localhost:8889:localhost:7000
myCluster

# connect to the notebook server using the browser of you choice
http://localhost:8889




#
# If you need to stop your notebooks server you may need to kill the server
# there is probably a cleaner way to do this
# $ ps -el | head -1; ps -efl | grep python
#

  http://jupyter.org/


P.S. Jupiter is in the process of being released. The new Juypter lab alpha
was just announced it looks really sweet.



From:  pseudo oduesp 
Date:  Friday, July 22, 2016 at 2:08 AM
To:  Andrew Davidson 
Subject:  Re: spark and plot data

> HI andy  ,
> thanks for reply ,
> i tell it just hard to each time switch  between local concept and destributed
> concept , for example zepplin give easy way to interact with data ok , but
> it's hard to configure on huge cluster with lot of node in my case i have
> cluster with 69 nodes and i process huge volume of data with pyspark and it
> cool but when  i want to plot some chart  i get hard job to make it .
> 
> i sampling my result or aggregate  , take for example if i user randomforest
> algorithme in machine learning  i want to retrive  most importante features
> with my version alerady installed in our cluster (1.5.0) i can't get this.
> 
> do you have any solution.
> 
> Thanks 
> 
> 2016-07-21 18:44 GMT+02:00 Andy Davidson :
>> Hi Pseudo
>> 
>> Plotting, graphing, data visualization, report generation are common needs in
>> scientific and enterprise computing.
>> 
>> Can you tell me more about your use case? What is it about the current
>> process / workflow do you think could be improved by pushing plotting (I
>> assume you mean plotting and graphing) into spark.
>> 
>> 
>> In my personal work all the graphing is done in the driver on summary stats
>> calculated using spark. So for me using standard python libs has not been a
>> problem.
>> 
>> Andy
>> 
>> From:  pseudo oduesp 
>> Date:  Thursday, July 21, 2016 at 8:30 AM
>> To:  "user @spark" 
>> Subject:  spark and plot data
>> 
>>> Hi , 
>>> i know spark  it s engine  to compute large data set but for me i work with
>>> pyspark and it s very wonderful machine
>>> 
>>> my question  we  don't have tools for ploting data each time we have to
>>> switch and go back to python for using plot.
>>> but when you have large result scatter plot or roc curve  you cant use
>>> collect to take data .
>>> 
>>> somone have propostion for plot .
>>> 
>>> thanks 
> 




startIPythonNotebook.sh
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Creating a DataFrame from scratch

2016-07-22 Thread Everett Anderson
Actually, sorry, my mistake, you're calling

DataFrame df = sqlContext.createDataFrame(data,
org.apache.spark.sql.types.NumericType.class);

and giving it a list of objects which aren't NumericTypes, but the
wildcards in the signature let it happen.

I'm curious what'd happen if you gave it Integer.class, but I suspect it
still won't work because Integer may not have the bean-style getters.


On Fri, Jul 22, 2016 at 9:37 AM, Everett Anderson  wrote:

> Hey,
>
> I think what's happening is that you're calling this createDataFrame
> method
> 
> :
>
> createDataFrame(java.util.List data, java.lang.Class beanClass)
>
> which expects a JavaBean-style class with get and set methods for the
> members, but Integer doesn't have such a getter.
>
> I bet there's an easier way if you just want a single-column DataFrame of
> a primitive type, but one way that would work is to manually construct the
> Rows using RowFactory.create()
> 
> and assemble the DataFrame from that like
>
> List rows = convert your List to this in a loop with
> RowFactory.create()
>
> StructType schema = DataTypes.createStructType(Collections.singletonList(
>  DataTypes.createStructField("int_field", DataTypes.IntegerType,
> true)));
>
> DataFrame intDataFrame = sqlContext.createDataFrame(rows, schema);
>
>
>
> On Fri, Jul 22, 2016 at 7:53 AM, Jean Georges Perrin  wrote:
>
>>
>>
>> I am trying to build a DataFrame from a list, here is the code:
>>
>> private void start() {
>> SparkConf conf = new SparkConf().setAppName("Data Set from Array"
>> ).setMaster("local");
>> SparkContext sc = new SparkContext(conf);
>> SQLContext sqlContext = new SQLContext(sc);
>>
>> Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
>> List data = Arrays.asList(l);
>>
>> System.out.println(data);
>>
>>
>> DataFrame df = sqlContext.createDataFrame(data,
>> org.apache.spark.sql.types.NumericType.class);
>> df.show();
>> }
>>
>> My result is (unpleasantly):
>>
>> [1, 2, 3, 4, 5, 6, 7]
>> ++
>> ||
>> ++
>> ||
>> ||
>> ||
>> ||
>> ||
>> ||
>> ||
>> ++
>>
>> I also tried with:
>> org.apache.spark.sql.types.NumericType.class
>> org.apache.spark.sql.types.IntegerType.class
>> org.apache.spark.sql.types.ArrayType.class
>>
>> I am probably missing something super obvious :(
>>
>> Thanks!
>>
>> jg
>>
>>
>>
>


Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
Hello,

I'd like to understand how Spark Streaming(direct) would handle Kafka
partition addition?
Will a running job be aware of new partitions and read from it?
Since it uses Kafka APIs to query offsets and offsets are handled
internally.

Srikanth


Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All,

Any suggestions for this

Regards,
VG

On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:

> Hi All,
>
> I am really confused how to proceed further. Please help.
>
> I have a dataset created as follows:
> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>
> Now I need to map each name with a unique index and I did the following
> JavaPairRDD indexedBId = business.javaRDD()
>.zipWithIndex();
>
> In later part of the code I need to change a datastructure and update name
> with index value generated above .
> I am unable to figure out how to do a look up here..
>
> Please suggest /.
>
> If there is a better way to do this please suggest that.
>
> Regards
> VG
>
>


Re: Fast database with writes per second and horizontal scaling

2016-07-22 Thread Marco Colombo
Yes, this is not a question for spark user list.
Btw, in db world, performances depend also on which data you have and
schema you want to use.
First put a target, then evaluate technology.
Cassandra can be really fast di you put data via sstableloader  or copy rather
then insert line by line.
Every db has a preferred path for data ingestion.

Il martedì 12 luglio 2016, Yash Sharma  ha scritto:

> Spark is more of an execution engine rather than a database. Hive is a
> data warehouse but I still like treating it as an execution engine.
>
> For databases, You could compare HBase and Cassandra as they both have
> very wide usage and proven performance. We have used Cassandra in the past
> and were very happy with the results. You should move this discussion on
> Cassandra's/HBase's mailing list for better advice.
>
> Cheers
>
> On Tue, Jul 12, 2016 at 3:23 PM, ayan guha  > wrote:
>
>> HI
>>
>> HBase is pretty neat itself. But speed is not the criteria to choose
>> Hbase over Cassandra (or vicey versa).. Slowness can very well because of
>> design issues, and unfortunately it will not help changing technology in
>> that case :)
>>
>> I would suggest you to quantify "slow"-ness in conjunction
>> with infrastructure you have and I am sure good people here will help.
>>
>> Best
>> Ayan
>>
>> On Tue, Jul 12, 2016 at 3:01 PM, Ashok Kumar <
>> ashok34...@yahoo.com.invalid
>> > wrote:
>>
>>> Anyone in Spark as well
>>>
>>> My colleague has been using Cassandra. However, he says it is too slow
>>> and not user friendly/
>>> MongodDB as a doc databases is pretty neat but not fast enough
>>>
>>> May main concern is fast writes per second and good scaling.
>>>
>>>
>>> Hive on Spark or Tez?
>>>
>>> How about Hbase. or anything else
>>>
>>> Any expert advice warmly acknowledged..
>>>
>>> thanking yo
>>>
>>>
>>> On Monday, 11 July 2016, 17:24, Ashok Kumar >> > wrote:
>>>
>>>
>>> Hi Gurus,
>>>
>>> Advice appreciated from Hive gurus.
>>>
>>> My colleague has been using Cassandra. However, he says it is too slow
>>> and not user friendly/
>>> MongodDB as a doc databases is pretty neat but not fast enough
>>>
>>> May main concern is fast writes per second and good scaling.
>>>
>>>
>>> Hive on Spark or Tez?
>>>
>>> How about Hbase. or anything else
>>>
>>> Any expert advice warmly acknowledged..
>>>
>>> thanking you
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

-- 
Ing. Marco Colombo


Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Adding this to build.sbt worked. Thanks Jacek

assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
  case "application.conf"=> MergeStrategy.concat
  case "unwanted.txt"=>
MergeStrategy.discard
  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
  case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

On Fri, Jul 22, 2016 at 7:44 AM, Jacek Laskowski  wrote:

> See https://github.com/sbt/sbt-assembly#merge-strategy
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 22, 2016 at 4:23 PM, janardhan shetty
>  wrote:
> > Changed to sbt.0.14.3 and it gave :
> >
> > [info] Packaging
> >
> /Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar
> > ...
> > java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
> > at
> java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233)
> >
> > Do we need to create assembly.sbt file inside project directory if so
> what
> > will the the contents of it for this config ?
> >
> > On Fri, Jul 22, 2016 at 5:42 AM, janardhan shetty <
> janardhan...@gmail.com>
> > wrote:
> >>
> >> Is scala version also the culprit? 2.10 and 2.11.8
> >>
> >> Also Can you give the steps to create sbt package command just like
> maven
> >> install from within intellij to create jar file in target directory ?
> >>
> >> On Jul 22, 2016 5:16 AM, "Jacek Laskowski"  wrote:
> >>>
> >>> Hi,
> >>>
> >>> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
> >>> start over. See
> >>>
> >>>
> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
> >>> for a sample Scala/sbt project with Spark 2.0 RC5.
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
> >>>  wrote:
> >>> > Hi,
> >>> >
> >>> > I was setting up my development environment.
> >>> >
> >>> > Local Mac laptop setup
> >>> > IntelliJ IDEA 14CE
> >>> > Scala
> >>> > Sbt (Not maven)
> >>> >
> >>> > Error:
> >>> > $ sbt package
> >>> > [warn] ::
> >>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
> >>> > [warn] ::
> >>> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > [warn] ::
> >>> > [warn]
> >>> > [warn] Note: Some unresolved dependencies have extra attributes.
> >>> > Check
> >>> > that these dependencies exist with the requested attributes.
> >>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> >>> > sbtVersion=0.13)
> >>> > [warn]
> >>> > [warn] Note: Unresolved dependencies path:
> >>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> >>> > sbtVersion=0.13)
> >>> >
> >>> >
> (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
> >>> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
> >>> > (scalaVersion=2.10, sbtVersion=0.13)
> >>> > sbt.ResolveException: unresolved dependency:
> >>> > com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > sbt.ResolveException: unresolved dependency:
> >>> > com.eed3si9n#sbt-assembly;0.13.8: not found
> >>> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
> >>> > at
> >>> > sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
> >>> > at
> >>> > sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
> >>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> >>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> >>> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
> >>> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> >>> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> >>> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> >>> >
> >>> >
> >>> >
> >>> > build.sbt:
> >>> >
> >>> > name := "MainTemplate"
> >>> > version := "1.0"
> >>> > scalaVersion := "2.11.8"
> >>> > libraryDependencies ++= {
> >>> >   val sparkVersion = "2.0.0-preview"
> >>> >   Seq(
> >>> > "org.apache.spark" %% "spark-core" % sparkVersion,
> >>> > "org.apache.spark" %% "spark-sql" % sparkVersion,
> >>> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
> >>> > "org.apache.spark" %% "spark-mllib" % 

Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
Can someone please help here.

I tried both scala 2.10 and 2.11 on the system



On Fri, Jul 22, 2016 at 7:59 PM, VG  wrote:

> I am using version 2.0.0-preview
>
>
>
> On Fri, Jul 22, 2016 at 7:47 PM, VG  wrote:
>
>> I am running into the following error when running ALS
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
>> at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
>> at yelp.TestUser.main(TestUser.java:101)
>>
>> here line 101 in the above error is the following in code.
>>
>> ALSModel model = als.fit(training);
>>
>>
>> Does anyone has a suggestion what is going on here and where I might be
>> going wrong ?
>> Please suggest
>>
>> -VG
>>
>
>


Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin


I am trying to build a DataFrame from a list, here is the code:

private void start() {
SparkConf conf = new SparkConf().setAppName("Data Set from 
Array").setMaster("local");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
List data = Arrays.asList(l);

System.out.println(data);

DataFrame df = sqlContext.createDataFrame(data, 
org.apache.spark.sql.types.NumericType.class);
df.show();
}

My result is (unpleasantly):

[1, 2, 3, 4, 5, 6, 7]
++
||
++
||
||
||
||
||
||
||
++

I also tried with:
org.apache.spark.sql.types.NumericType.class
org.apache.spark.sql.types.IntegerType.class
org.apache.spark.sql.types.ArrayType.class

I am probably missing something super obvious :(

Thanks!

jg




Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread Jacek Laskowski
See https://github.com/sbt/sbt-assembly#merge-strategy

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 22, 2016 at 4:23 PM, janardhan shetty
 wrote:
> Changed to sbt.0.14.3 and it gave :
>
> [info] Packaging
> /Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar
> ...
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
> at java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233)
>
> Do we need to create assembly.sbt file inside project directory if so what
> will the the contents of it for this config ?
>
> On Fri, Jul 22, 2016 at 5:42 AM, janardhan shetty 
> wrote:
>>
>> Is scala version also the culprit? 2.10 and 2.11.8
>>
>> Also Can you give the steps to create sbt package command just like maven
>> install from within intellij to create jar file in target directory ?
>>
>> On Jul 22, 2016 5:16 AM, "Jacek Laskowski"  wrote:
>>>
>>> Hi,
>>>
>>> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
>>> start over. See
>>>
>>> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
>>> for a sample Scala/sbt project with Spark 2.0 RC5.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
>>>  wrote:
>>> > Hi,
>>> >
>>> > I was setting up my development environment.
>>> >
>>> > Local Mac laptop setup
>>> > IntelliJ IDEA 14CE
>>> > Scala
>>> > Sbt (Not maven)
>>> >
>>> > Error:
>>> > $ sbt package
>>> > [warn] ::
>>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
>>> > [warn] ::
>>> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
>>> > [warn] ::
>>> > [warn]
>>> > [warn] Note: Some unresolved dependencies have extra attributes.
>>> > Check
>>> > that these dependencies exist with the requested attributes.
>>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>>> > sbtVersion=0.13)
>>> > [warn]
>>> > [warn] Note: Unresolved dependencies path:
>>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>>> > sbtVersion=0.13)
>>> >
>>> > (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
>>> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
>>> > (scalaVersion=2.10, sbtVersion=0.13)
>>> > sbt.ResolveException: unresolved dependency:
>>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>>> > sbt.ResolveException: unresolved dependency:
>>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>>> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
>>> > at
>>> > sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
>>> > at
>>> > sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
>>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>>> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
>>> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
>>> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
>>> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>>> >
>>> >
>>> >
>>> > build.sbt:
>>> >
>>> > name := "MainTemplate"
>>> > version := "1.0"
>>> > scalaVersion := "2.11.8"
>>> > libraryDependencies ++= {
>>> >   val sparkVersion = "2.0.0-preview"
>>> >   Seq(
>>> > "org.apache.spark" %% "spark-core" % sparkVersion,
>>> > "org.apache.spark" %% "spark-sql" % sparkVersion,
>>> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
>>> > "org.apache.spark" %% "spark-mllib" % sparkVersion
>>> >   )
>>> > }
>>> >
>>> > assemblyMergeStrategy in assembly := {
>>> >   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>>> >   case x => MergeStrategy.first
>>> > }
>>> >
>>> >
>>> > plugins.sbt
>>> >
>>> > logLevel := Level.Warn
>>> > addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
>>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
I am using version 2.0.0-preview



On Fri, Jul 22, 2016 at 7:47 PM, VG  wrote:

> I am running into the following error when running ALS
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
> at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
> at yelp.TestUser.main(TestUser.java:101)
>
> here line 101 in the above error is the following in code.
>
> ALSModel model = als.fit(training);
>
>
> Does anyone has a suggestion what is going on here and where I might be
> going wrong ?
> Please suggest
>
> -VG
>


WrappedArray in SparkSQL DF

2016-07-22 Thread KhajaAsmath Mohammed
Hi,

I am reading JSON file and I am facing difficulties trying to get
individula elements for this array. does anyone know how to get the
elements from WrappedArray(WrappedArray(String))

Schema:
++
|rows|
++
|[WrappedArray(Bon...|
++



WrappedArray(WrappedArray(Bondnotnotseven, 146517120, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0), WrappedArray(Bondnotnotseven, 146577600, 0.0,
0.0, 2.0, 2.0, 0.0, 0.0, 2.0, 2.0), WrappedArray(Bondnotnotseven,
146638080, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 10.0, 8.0),
WrappedArray(Bondnotnotseven, 146698560, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0,
20.0, 10.0))

Thanks,
Asmath


Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Changed to sbt.0.14.3 and it gave :

[info] Packaging
/Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar
...
java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF
at java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233)

Do we need to create assembly.sbt file inside project directory if so what
will the the contents of it for this config ?

On Fri, Jul 22, 2016 at 5:42 AM, janardhan shetty 
wrote:

> Is scala version also the culprit? 2.10 and 2.11.8
>
> Also Can you give the steps to create sbt package command just like maven
> install from within intellij to create jar file in target directory ?
> On Jul 22, 2016 5:16 AM, "Jacek Laskowski"  wrote:
>
>> Hi,
>>
>> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
>> start over. See
>>
>> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
>> for a sample Scala/sbt project with Spark 2.0 RC5.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
>>  wrote:
>> > Hi,
>> >
>> > I was setting up my development environment.
>> >
>> > Local Mac laptop setup
>> > IntelliJ IDEA 14CE
>> > Scala
>> > Sbt (Not maven)
>> >
>> > Error:
>> > $ sbt package
>> > [warn] ::
>> > [warn] ::  UNRESOLVED DEPENDENCIES ::
>> > [warn] ::
>> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
>> > [warn] ::
>> > [warn]
>> > [warn] Note: Some unresolved dependencies have extra attributes.
>> Check
>> > that these dependencies exist with the requested attributes.
>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>> > sbtVersion=0.13)
>> > [warn]
>> > [warn] Note: Unresolved dependencies path:
>> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
>> > sbtVersion=0.13)
>> > (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
>> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
>> > (scalaVersion=2.10, sbtVersion=0.13)
>> > sbt.ResolveException: unresolved dependency:
>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>> > sbt.ResolveException: unresolved dependency:
>> > com.eed3si9n#sbt-assembly;0.13.8: not found
>> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
>> > at
>> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
>> > at
>> sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
>> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
>> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
>> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
>> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>> >
>> >
>> >
>> > build.sbt:
>> >
>> > name := "MainTemplate"
>> > version := "1.0"
>> > scalaVersion := "2.11.8"
>> > libraryDependencies ++= {
>> >   val sparkVersion = "2.0.0-preview"
>> >   Seq(
>> > "org.apache.spark" %% "spark-core" % sparkVersion,
>> > "org.apache.spark" %% "spark-sql" % sparkVersion,
>> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
>> > "org.apache.spark" %% "spark-mllib" % sparkVersion
>> >   )
>> > }
>> >
>> > assemblyMergeStrategy in assembly := {
>> >   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>> >   case x => MergeStrategy.first
>> > }
>> >
>> >
>> > plugins.sbt
>> >
>> > logLevel := Level.Warn
>> > addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
>> >
>>
>


ml ALS.fit(..) issue

2016-07-22 Thread VG
I am running into the following error when running ALS

Exception in thread "main" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
at yelp.TestUser.main(TestUser.java:101)

here line 101 in the above error is the following in code.

ALSModel model = als.fit(training);


Does anyone has a suggestion what is going on here and where I might be
going wrong ?
Please suggest

-VG


Is spark-submit a single point of failure?

2016-07-22 Thread Sivakumaran S
Hello,

I have a spark streaming process on a cluster ingesting a realtime data stream 
from Kafka. The aggregated, processed output is written to Cassandra and also 
used for dashboard display.

My question is - If the node running the driver program fails, I am guessing 
that the entire process fails and has to be restarted. Is there any way to 
obviate this? Is my understanding correct that the spark-submit in its current 
form is a Single Point of Vulnerability, much akin to the NameNode in HDFS?

regards

Sivakumaran S
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
Zeppelin works great. The other thing that we have done in notebooks (like 
Zeppelin or Databricks) which support multiple types of spark session is 
register Spark SQL temp tables in our scala code then escape hatch to python 
for plotting with seaborn/matplotlib when the built in plots are insufficient.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 22, 2016 at 3:04:48 AM, Marco Colombo (ing.marco.colo...@gmail.com) 
wrote:

Take a look at zeppelin

http://zeppelin.apache.org

Il giovedì 21 luglio 2016, Andy Davidson  ha 
scritto:
Hi Pseudo

Plotting, graphing, data visualization, report generation are common needs in 
scientific and enterprise computing.

Can you tell me more about your use case? What is it about the current process 
/ workflow do you think could be improved by pushing plotting (I assume you 
mean plotting and graphing) into spark.


In my personal work all the graphing is done in the driver on summary stats 
calculated using spark. So for me using standard python libs has not been a 
problem.

Andy

From: pseudo oduesp 
Date: Thursday, July 21, 2016 at 8:30 AM
To: "user @spark" 
Subject: spark and plot data

Hi , 
i know spark  it s engine  to compute large data set but for me i work with 
pyspark and it s very wonderful machine 

my question  we  don't have tools for ploting data each time we have to switch 
and go back to python for using plot.
but when you have large result scatter plot or roc curve  you cant use collect 
to take data .

somone have propostion for plot .

thanks 


--
Ing. Marco Colombo


Re: How can we control CPU and Memory per Spark job operation..

2016-07-22 Thread Pedro Rodriguez
Sorry, wasn’t very clear (looks like Pavan’s response was dropped from list for 
some reason as well).

I am assuming that:
1) the first map is CPU bound
2) the second map is heavily memory bound

To be specific, lets saw you are using 4 m3.2xlarge instances which have 8 CPUs 
and 30GB of ram each for a total of 32 cores and 120GB of ram. Since the NLP 
model can’t be distributed that means every worker/core must use 4GB of RAM. If 
the cluster is fully utilized that means that just for the NLP model you are 
consuming 32 * 4GB = 128GB of ram. The cluster at this point is out of memory 
just for the NLP model not considering the data set itself. My suggestion would 
be see if r3.8xlarge instances will work (or even X1s if you have access) since 
the cpu/memory fraction is better. Here is the “hack” I proposed in more detail 
(basically n partitions < total cores):

1) have the first map have a regular number of partitions, suppose 32 * 4 = 128 
which is a reasonable starting place
2) repartition immediately after that map to 16 partitions. At this point, 
spark is not guaranteed to distributed you work evenly across the 4 nodes, but 
it probably will. The net result is that half the CPU cores are idle, but the 
NLP model is at worse using 16 * 4GB = 64GB of RAM. To be sure, this is a hack 
since the nodes being evenly distributed work is not guaranteed. 

If you wanted to do this as not a hack, you could perform the map, checkpoint 
your work, end the job, then submit a new job where the cpu/memory ratio is 
more favorable which reads from the prior job’s output. I am guessing this 
heavily depends on how expensive reloading the data set from disk/network is. 

Hopefully one of these helps,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 17, 2016 at 6:16:41 AM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,

How would that help?! Why would you do that?

Jacek


On 17 Jul 2016 7:19 a.m., "Pedro Rodriguez"  wrote:
You could call map on an RDD which has “many” partitions, then call 
repartition/coalesce to drastically reduce the number of partitions so that 
your second map job has less things running.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 16, 2016 at 4:46:04 PM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,

My understanding is that these two map functions will end up as a job
with one stage (as if you wrote the two maps as a single map) so you
really need as much vcores and memory as possible for map1 and map2. I
initially thought about dynamic allocation of executors that may or
may not help you with the case, but since there's just one stage I
don't think you can do much.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta  wrote:
> Hi All,
>
> Here is my use case:
>
> I have a pipeline job consisting of 2 map functions:
>
> CPU intensive map operation that does not require a lot of memory.
> Memory intensive map operation that requires upto 4 GB of memory. And this
> 4GB memory cannot be distributed since it is an NLP model.
>
> Ideally what I like to do is to use 20 nodes with 4 cores each and minimal
> memory for first map operation and then use only 3 nodes with minimal CPU
> but each having 4GB of memory for 2nd operation.
>
> While it is possible to control this parallelism for each map operation in
> spark. I am not sure how to control the resources for each operation.
> Obviously I don’t want to start off the job with 20 nodes with 4 cores and
> 4GB memory since I cannot afford that much memory.
>
> We use Yarn with Spark. Any suggestions ?
>
> Thanks and regards,
> Pavan
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ml models distribution

2016-07-22 Thread Sean Owen
No there isn't anything in particular, beyond the various bits of
serialization support that write out something to put in your storage
to begin with. What you do with it after reading and before writing is
up to your app, on purpose.

If you mean you're producing data outside the model that your model
uses, your model data might be produced by an RDD operation, and saved
that way. There it's no different than anything else you do with RDDs.

What part are you looking to automate beyond those things? that's most of it.

On Fri, Jul 22, 2016 at 2:04 PM, Sergio Fernández  wrote:
> Hi Sean,
>
> On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen  wrote:
>>
>> If you mean, how do you distribute a new model in your application,
>> then there's no magic to it. Just reference the new model in the
>> functions you're executing in your driver.
>>
>> If you implemented some other manual way of deploying model info, just
>> do that again. There's no special thing to know.
>
>
> Well, because some huge model, we typically bundle both logic
> (pipeline/application)  and models separately. Normally we use a shared
> stores (e.g., HDFS) or coordinated distribution of the models. But I wanted
> to know if there is any infrastructure in Spark that specifically addresses
> such need.
>
> Thanks.
>
> Cheers,
>
> P.S.: sorry Jacek, with "ml" I meant "Machine Learning". I thought is a
> quite spread acronym. Sorry for the possible confusion.
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All,

I am really confused how to proceed further. Please help.

I have a dataset created as follows:
Dataset b = sqlContext.sql("SELECT bid, name FROM business");

Now I need to map each name with a unique index and I did the following
JavaPairRDD indexedBId = business.javaRDD()
   .zipWithIndex();

In later part of the code I need to change a datastructure and update name
with index value generated above .
I am unable to figure out how to do a look up here..

Please suggest /.

If there is a better way to do this please suggest that.

Regards
VG


Re: ml models distribution

2016-07-22 Thread Sergio Fernández
Hi Sean,

On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen  wrote:
>
> If you mean, how do you distribute a new model in your application,
> then there's no magic to it. Just reference the new model in the
> functions you're executing in your driver.
>
> If you implemented some other manual way of deploying model info, just
> do that again. There's no special thing to know.
>

Well, because some huge model, we typically bundle both logic
(pipeline/application)  and models separately. Normally we use a shared
stores (e.g., HDFS) or coordinated distribution of the models. But I wanted
to know if there is any infrastructure in Spark that specifically addresses
such need.

Thanks.

Cheers,

P.S.: sorry Jacek, with "ml" I meant "Machine Learning". I thought is a
quite spread acronym. Sorry for the possible confusion.


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: Create dataframe column from list

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking
desperately for the solution. I am facing en exception
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html

plz help me.. I couldn't find any solution..

On Fri, Jul 22, 2016 at 5:26 PM, Ashutosh Kumar 
wrote:

>
> http://stackoverflow.com/questions/36382052/converting-list-to-column-in-spark
>
>
> On Fri, Jul 22, 2016 at 5:15 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> Can somebody help me by creating the dataframe column from the scala list
>> .
>> Would really appreciate the help .
>>
>> Thanks ,
>> Divya
>>
>
>


Re: what contribute to Task Deserialization Time

2016-07-22 Thread Silvio Fiorito
Are you referencing member variables or other objects of your driver in your 
transformations? Those would have to be serialized and shipped to each executor 
when that job kicks off.

On 7/22/16, 8:54 AM, "Jacek Laskowski"  wrote:

Hi,

I can't specifically answer your question, but my understanding of
Task Deserialization Time is that it's time to deserialize a
serialized task from the driver before it gets run. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L236
and on.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 11:35 AM, patcharee  wrote:
> Hi,
>
> I'm running a simple job (reading sequential file and collect data at the
> driver) with yarn-client mode. When looking at the history server UI, Task
> Deserialization Time of tasks are quite different (5 ms to 5 s). What
> contribute to this Task Deserialization Time?
>
> Thank you in advance!
>
> Patcharee
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





Re: what contribute to Task Deserialization Time

2016-07-22 Thread Jacek Laskowski
Hi,

I can't specifically answer your question, but my understanding of
Task Deserialization Time is that it's time to deserialize a
serialized task from the driver before it gets run. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L236
and on.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 11:35 AM, patcharee  wrote:
> Hi,
>
> I'm running a simple job (reading sequential file and collect data at the
> driver) with yarn-client mode. When looking at the history server UI, Task
> Deserialization Time of tasks are quite different (5 ms to 5 s). What
> contribute to this Task Deserialization Time?
>
> Thank you in advance!
>
> Patcharee
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Marco - I like the idea of sticking with DataFrames ;)


> On Jul 22, 2016, at 7:07 AM, Marco Mistroni  wrote:
> 
> Hello Jean
>  you can take ur current DataFrame and send them to mllib (i was doing that 
> coz i dindt know the ml package),but the process is littlebit cumbersome
> 
> 
> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
> 2. run your ML model
> 
> i'd suggest you stick to DataFrame + ml package :)
> 
> hth
> 
> 
> 
> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin  > wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear 
> regression over a list of values) in Java. I have found a few, but I only saw 
> them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
> I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> j...@jgp.net  / @jgperrin
> 
> 
> 
> 
> 



Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Is scala version also the culprit? 2.10 and 2.11.8

Also Can you give the steps to create sbt package command just like maven
install from within intellij to create jar file in target directory ?
On Jul 22, 2016 5:16 AM, "Jacek Laskowski"  wrote:

> Hi,
>
> There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
> start over. See
>
> https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
> for a sample Scala/sbt project with Spark 2.0 RC5.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
>  wrote:
> > Hi,
> >
> > I was setting up my development environment.
> >
> > Local Mac laptop setup
> > IntelliJ IDEA 14CE
> > Scala
> > Sbt (Not maven)
> >
> > Error:
> > $ sbt package
> > [warn] ::
> > [warn] ::  UNRESOLVED DEPENDENCIES ::
> > [warn] ::
> > [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
> > [warn] ::
> > [warn]
> > [warn] Note: Some unresolved dependencies have extra attributes.
> Check
> > that these dependencies exist with the requested attributes.
> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> > sbtVersion=0.13)
> > [warn]
> > [warn] Note: Unresolved dependencies path:
> > [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> > sbtVersion=0.13)
> > (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
> > [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
> > (scalaVersion=2.10, sbtVersion=0.13)
> > sbt.ResolveException: unresolved dependency:
> > com.eed3si9n#sbt-assembly;0.13.8: not found
> > sbt.ResolveException: unresolved dependency:
> > com.eed3si9n#sbt-assembly;0.13.8: not found
> > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
> > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
> > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
> > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> > at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> >
> >
> >
> > build.sbt:
> >
> > name := "MainTemplate"
> > version := "1.0"
> > scalaVersion := "2.11.8"
> > libraryDependencies ++= {
> >   val sparkVersion = "2.0.0-preview"
> >   Seq(
> > "org.apache.spark" %% "spark-core" % sparkVersion,
> > "org.apache.spark" %% "spark-sql" % sparkVersion,
> > "org.apache.spark" %% "spark-streaming" % sparkVersion,
> > "org.apache.spark" %% "spark-mllib" % sparkVersion
> >   )
> > }
> >
> > assemblyMergeStrategy in assembly := {
> >   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
> >   case x => MergeStrategy.first
> > }
> >
> >
> > plugins.sbt
> >
> > logLevel := Level.Warn
> > addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
> >
>


Re: Create dataframe column from list

2016-07-22 Thread Ashutosh Kumar
http://stackoverflow.com/questions/36382052/converting-list-to-column-in-spark


On Fri, Jul 22, 2016 at 5:15 PM, Divya Gehlot 
wrote:

> Hi,
> Can somebody help me by creating the dataframe column from the scala list .
> Would really appreciate the help .
>
> Thanks ,
> Divya
>


Re: getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-22 Thread Jacek Laskowski
Hi,

It appears that lag didn't work properly, right? I'm new to it, and
remember that in Scala you'd need to define a WindowSpec. I don't see
one in your SQL query.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 5:50 AM, Divya Gehlot  wrote:
> Hi,
>
> val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') -
> lag(unix_timestamp(time2,'$timeFmt'))) as time_diff  from df_table");
>
> Instead of time difference in seconds I am gettng null .
>
> Would reay appreciate the help.
>
>
> Thanks,
> Divya

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Create dataframe column from list

2016-07-22 Thread Jacek Laskowski
Hi,

Doh, just rebuilding Spark so...writing off the top of my head.

val cols = Seq("hello", "world")
val columns = cols.map(Column.col)

See 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 22, 2016 at 1:45 PM, Divya Gehlot  wrote:
> Hi,
> Can somebody help me by creating the dataframe column from the scala list .
> Would really appreciate the help .
>
> Thanks ,
> Divya

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread Jacek Laskowski
Hi,

There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and
start over. See
https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager
for a sample Scala/sbt project with Spark 2.0 RC5.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 22, 2016 at 2:08 PM, janardhan shetty
 wrote:
> Hi,
>
> I was setting up my development environment.
>
> Local Mac laptop setup
> IntelliJ IDEA 14CE
> Scala
> Sbt (Not maven)
>
> Error:
> $ sbt package
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
> [warn] ::
> [warn]
> [warn] Note: Some unresolved dependencies have extra attributes.  Check
> that these dependencies exist with the requested attributes.
> [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> sbtVersion=0.13)
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
> sbtVersion=0.13)
> (/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
> [warn]   +- default:maintemplate-build:0.1-SNAPSHOT
> (scalaVersion=2.10, sbtVersion=0.13)
> sbt.ResolveException: unresolved dependency:
> com.eed3si9n#sbt-assembly;0.13.8: not found
> sbt.ResolveException: unresolved dependency:
> com.eed3si9n#sbt-assembly;0.13.8: not found
> at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
> at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
> at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>
>
>
> build.sbt:
>
> name := "MainTemplate"
> version := "1.0"
> scalaVersion := "2.11.8"
> libraryDependencies ++= {
>   val sparkVersion = "2.0.0-preview"
>   Seq(
> "org.apache.spark" %% "spark-core" % sparkVersion,
> "org.apache.spark" %% "spark-sql" % sparkVersion,
> "org.apache.spark" %% "spark-streaming" % sparkVersion,
> "org.apache.spark" %% "spark-mllib" % sparkVersion
>   )
> }
>
> assemblyMergeStrategy in assembly := {
>   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>   case x => MergeStrategy.first
> }
>
>
> plugins.sbt
>
> logLevel := Level.Warn
> addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ml models distribution

2016-07-22 Thread Jacek Laskowski
Hehe, Sean. I knew that (and I knew the answer), but meant to ask a
co-question to help to find the answer *together* :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen  wrote:
> Machine Learning
>
> If you mean, how do you distribute a new model in your application,
> then there's no magic to it. Just reference the new model in the
> functions you're executing in your driver.
>
> If you implemented some other manual way of deploying model info, just
> do that again. There's no special thing to know.
>
> On Fri, Jul 22, 2016 at 11:39 AM, Jacek Laskowski  wrote:
>> Hi,
>>
>> What's a ML model?
>>
>> (I'm sure once we found out the answer you'd know the answer for your
>> question :))
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 22, 2016 at 11:49 AM, Sergio Fernández  wrote:
>>> Hi,
>>>
>>>  I have one question:
>>>
>>> How is the ML models distribution done across all nodes of a Spark cluster?
>>>
>>> I'm thinking about scenarios where the pipeline implementation does not
>>> necessary need to change, but the models have been upgraded.
>>>
>>> Thanks in advance.
>>>
>>> Best regards,
>>>
>>> --
>>> Sergio Fernández
>>> Partner Technology Manager
>>> Redlink GmbH
>>> m: +43 6602747925
>>> e: sergio.fernan...@redlink.co
>>> w: http://redlink.co
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Hi,

I was setting up my development environment.

Local Mac laptop setup
IntelliJ IDEA 14CE
Scala
Sbt (Not maven)

Error:
$ sbt package
[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: com.eed3si9n#sbt-assembly;0.13.8: not found
[warn] ::
[warn]
[warn] Note: Some unresolved dependencies have extra attributes.  Check
that these dependencies exist with the requested attributes.
[warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
sbtVersion=0.13)
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.eed3si9n:sbt-assembly:0.13.8 (scalaVersion=2.10,
sbtVersion=0.13)
(/Users/jshetty/sparkApplications/MainTemplate/project/plugins.sbt#L2-3)
[warn]   +- default:maintemplate-build:0.1-SNAPSHOT
(scalaVersion=2.10, sbtVersion=0.13)
sbt.ResolveException: unresolved dependency:
com.eed3si9n#sbt-assembly;0.13.8: not found
sbt.ResolveException: unresolved dependency:
com.eed3si9n#sbt-assembly;0.13.8: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)



build.sbt:

name := "MainTemplate"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= {
  val sparkVersion = "2.0.0-preview"
  Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
  )
}

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}


plugins.sbt

logLevel := Level.Warn
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.8")


Create dataframe column from list

2016-07-22 Thread Divya Gehlot
Hi,
Can somebody help me by creating the dataframe column from the scala list .
Would really appreciate the help .

Thanks ,
Divya


Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
Hello Jean
 you can take ur current DataFrame and send them to mllib (i was doing that
coz i dindt know the ml package),but the process is littlebit cumbersome


1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
2. run your ML model

i'd suggest you stick to DataFrame + ml package :)

hth



On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin  wrote:

> Hi,
>
> I am looking for some really super basic examples of MLlib (like a linear
> regression over a list of values) in Java. I have found a few, but I only
> saw them using JavaRDD... and not DataFrame.
>
> I was kind of hoping to take my current DataFrame and send them in MLlib.
> Am I too optimistic? Do you know/have any example like that?
>
> Thanks!
>
> jg
>
>
> Jean Georges Perrin
> j...@jgp.net / @jgperrin
>
>
>
>
>


Re: ml models distribution

2016-07-22 Thread Sean Owen
Machine Learning

If you mean, how do you distribute a new model in your application,
then there's no magic to it. Just reference the new model in the
functions you're executing in your driver.

If you implemented some other manual way of deploying model info, just
do that again. There's no special thing to know.

On Fri, Jul 22, 2016 at 11:39 AM, Jacek Laskowski  wrote:
> Hi,
>
> What's a ML model?
>
> (I'm sure once we found out the answer you'd know the answer for your
> question :))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jul 22, 2016 at 11:49 AM, Sergio Fernández  wrote:
>> Hi,
>>
>>  I have one question:
>>
>> How is the ML models distribution done across all nodes of a Spark cluster?
>>
>> I'm thinking about scenarios where the pipeline implementation does not
>> necessary need to change, but the models have been upgraded.
>>
>> Thanks in advance.
>>
>> Best regards,
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernan...@redlink.co
>> w: http://redlink.co
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Bryan - I keep forgetting about the examples... This is almost it :) I 
can work with that :)


> On Jul 22, 2016, at 1:39 AM, Bryan Cutler  wrote:
> 
> Hi JG,
> 
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses 
> DataFrames.  Take a look at this example 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>  
> 
> 
> This example uses a Dataset, which is type equivalent to a DataFrame.
> 
> 
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin  > wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear 
> regression over a list of values) in Java. I have found a few, but I only saw 
> them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
> I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> j...@jgp.net  / @jgperrin
> 
> 
> 
> 
> 



Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Hi Jules,

Thanks but not really: I know what DataFrames are and I actually use them - 
specially as the RDD will slowly fade. A lot of the example I see are focusing 
on cleaning / prep the data, which is an important part, but not really on 
"after"... Sorry if I am not completely clear.

> On Jul 22, 2016, at 1:08 AM, Jules Damji  wrote:
> 
> Is this what you had in mind?
> 
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>  
> 
> 
> Cheers 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Jul 21, 2016, at 8:41 PM, Jean Georges Perrin  > wrote:
> 
>> Hi,
>> 
>> I am looking for some really super basic examples of MLlib (like a linear 
>> regression over a list of values) in Java. I have found a few, but I only 
>> saw them using JavaRDD... and not DataFrame.
>> 
>> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
>> I too optimistic? Do you know/have any example like that?
>> 
>> Thanks!
>> 
>> jg
>> 
>> 
>> Jean Georges Perrin
>> j...@jgp.net  / @jgperrin
>> 
>> 
>> 
>> 



Re: ml models distribution

2016-07-22 Thread Jacek Laskowski
Hi,

What's a ML model?

(I'm sure once we found out the answer you'd know the answer for your
question :))

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 22, 2016 at 11:49 AM, Sergio Fernández  wrote:
> Hi,
>
>  I have one question:
>
> How is the ML models distribution done across all nodes of a Spark cluster?
>
> I'm thinking about scenarios where the pipeline implementation does not
> necessary need to change, but the models have been upgraded.
>
> Thanks in advance.
>
> Best regards,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



ml models distribution

2016-07-22 Thread Sergio Fernández
Hi,

 I have one question:

How is the ML models distribution done across all nodes of a Spark cluster?

I'm thinking about scenarios where the pipeline implementation does not
necessary need to change, but the models have been upgraded.

Thanks in advance.

Best regards,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: spark and plot data

2016-07-22 Thread Marco Colombo
Take a look at zeppelin

http://zeppelin.apache.org

Il giovedì 21 luglio 2016, Andy Davidson  ha
scritto:

> Hi Pseudo
>
> Plotting, graphing, data visualization, report generation are common needs
> in scientific and enterprise computing.
>
> Can you tell me more about your use case? What is it about the current
> process / workflow do you think could be improved by pushing plotting (I
> assume you mean plotting and graphing) into spark.
>
>
> In my personal work all the graphing is done in the driver on summary
> stats calculated using spark. So for me using standard python libs has not
> been a problem.
>
> Andy
>
> From: pseudo oduesp  >
> Date: Thursday, July 21, 2016 at 8:30 AM
> To: "user @spark"  >
> Subject: spark and plot data
>
> Hi ,
> i know spark  it s engine  to compute large data set but for me i work
> with pyspark and it s very wonderful machine
>
> my question  we  don't have tools for ploting data each time we have to
> switch and go back to python for using plot.
> but when you have large result scatter plot or roc curve  you cant use
> collect to take data .
>
> somone have propostion for plot .
>
> thanks
>
>

-- 
Ing. Marco Colombo


Re: How spark decides whether to do BroadcastHashJoin or SortMergeJoin

2016-07-22 Thread Matthias Niehoff
Hi,

there is a property you can set. Quoting the docs (
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
)

spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) Configures the
maximum size in bytes for a table that will be broadcast to all worker
nodes when performing a join. By setting this value to -1 broadcasting can
be disabled.

2016-07-20 10:07 GMT+02:00 raaggarw :

> Hi,
>
> How spark decides/optimizes internally as to when it needs to a
> BroadcastHashJoin vs SortMergeJoin? Is there anyway we can guide from
> outside or through options which Join to use?
> Because in my case when i am trying to do a join, spark makes that join as
> BroadCastHashJoin internally and when join is actually being executed it
> waits for broadcast to be done (which is big data), resulting in timeout.
> I do not want to increase value of timeout i.e.
> "spark.sql.broadcastTimeout". Rather i want this to be done via
> SortMergeJoin. How can i enforce that?
>
> Thanks
> Ravi
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-decides-whether-to-do-BroadcastHashJoin-or-SortMergeJoin-tp27369.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet


Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi,
We are also running Connected Components test with GraphX. We ran experiments 
using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a 
single executor per machine. We got this result:
Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 
100 seconds to find Connected component. It takes 600s when we tried to 
increase the number of edges to 2^27. We are so interested in how you can get 
such good results. 
We will be so appreciated if you could answer my following questions:
1. Which Connected component code did you use? Did you use the default 
org.apache.spark.graphx.ConnectedComponents lib which implements using 
Pregel?Have you made any changes?
2.By saying 20 cores with 2-way,did you mean total 40 threads cpu?
3. Addition to the settings you have mentioned,have you made any other changes 
in files spark-default.conf and spark-env.sh? Could you please just paste the 
two files so that we can compare?
4.When you mean Parallel GC, could you please give more detail guides on how to 
optimize this setting?which parameters should we set?

Appreciating for any feedback!
Thank you,
Yilei

On 2016-06-16 09:01 (+0800), Maja Kabiljo wrote: 
> Hi,> 
> 
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.> 
> 
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and 
> also generated graphs with properties similar to Facebook social graph with 
> various number of edges. Apart from performance we tried to see what is the 
> minimum amount of resources it requires in order to handle graph of some 
> size.> 
> 
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine. Additionally we used:> 
> 
> * spark.shuffle.manager=hash, spark.shuffle.service.enabled=false> 
> * Parallel GC> 
> * PartitionStrategy.EdgePartition2D> 
> * 8*numberOfExecutors partitions> 
> 
> Here are some data points which we got:> 
> 
> * Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.> 
> * For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected components.> 
> * Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.> 
> 
> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?> 
> 
> Any feedback would be highly appreciated!> 
> 
> Thank you,> 
> Maja> 
>


发自我的 iPhone

Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi,
We are also running Connected Components test with GraphX. We ran experiments 
using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a 
single executor per machine. We got this result:
Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 
100 seconds to find Connected component. It takes 600s when we tried to 
increase the number of edges to 2^27. We are so interested in how you can get 
such good results.
We will be so appreciated if you could answer my following questions:
1. Which Connected component code did you use? Did you use the default 
org.apache.spark.graphx.ConnectedComponents lib which implements using 
Pregel?Have you made any changes?
2.By saying 20 cores with 2-way,did you mean total 40 threads cpu?
3. Addition to the settings you have mentioned,have you made any other changes 
in files spark-default.conf and spark-env.sh? Could you please just paste the 
two files so that we can compare?
4.When you mean Parallel GC, could you please give more detail guides on how to 
optimize this setting?which parameters should we set?

Appreciating for any feedback!
Thank you,
Yilei

On 2016-06-16 09:01 (+0800), Maja Kabiljo wrote:
> Hi,>
>
> We are running some experiments with GraphX in order to compare it with other 
> systems. There are multiple settings which significantly affect performance, 
> and we experimented a lot in order to tune them well. I'll share here what 
> are the best we found so far and which results we got with them, and would 
> really appreciate if anyone who used GraphX before has any advice on what 
> else can make it even better, or confirm that these results are as good as it 
> gets.>
>
> Algorithms we used are pagerank and connected components. We used Twitter and 
> UK graphs from the GraphX paper 
> (https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and 
> also generated graphs with properties similar to Facebook social graph with 
> various number of edges. Apart from performance we tried to see what is the 
> minimum amount of resources it requires in order to handle graph of some 
> size.>
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
> 2-way SMT, always fixing number of executors (min=max=initial), giving 40GB 
> or 80GB per executor, and making sure we run only a single executor per 
> machine. Additionally we used:>
>
> * spark.shuffle.manager=hash, spark.shuffle.service.enabled=false>
> * Parallel GC>
> * PartitionStrategy.EdgePartition2D>
> * 8*numberOfExecutors partitions>
>
> Here are some data points which we got:>
>
> * Running on Facebook-like graph with 2 billion edges, using 4 executors with 
> 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 seconds 
> to find connected components. It failed when we tried to use 2 executors, or 
> 4 executors with 40GB each.>
> * For graph with 10 billion edges we needed 16 executors with 80GB each (it 
> failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds 
> for connected components.>
> * Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
> 473s, connected components 264s. With 4 executors 80GB each it worked but was 
> struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.>
>
> One more thing, we were not able to reproduce what's mentioned in the paper 
> about fault tolerance (section 5.2). If we kill an executor during first few 
> iterations it recovers successfully, but if killed in later iterations 
> reconstruction of each iteration starts taking exponentially longer and 
> doesn't finish after letting it run for a few hours. Are there some 
> additional parameters which we need to set in order for this to work?>
>
> Any feedback would be highly appreciated!>
>
> Thank you,>
> Maja>
>


 iPhone



Re: MLlib, Java, and DataFrame

2016-07-22 Thread VG
Interesting. thanks for this information.

On Fri, Jul 22, 2016 at 11:26 AM, Bryan Cutler  wrote:

> ML has a DataFrame based API, while MLlib is RDDs and will be deprecated
> as of Spark 2.0.
>
> On Thu, Jul 21, 2016 at 10:41 PM, VG  wrote:
>
>> Why do we have these 2 packages ... ml and mlib?
>> What is the difference in these
>>
>>
>>
>> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:
>>
>>> Hi JG,
>>>
>>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>>> DataFrames.  Take a look at this example
>>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>>
>>> This example uses a Dataset, which is type equivalent to a
>>> DataFrame.
>>>
>>>
>>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin 
>>> wrote:
>>>
 Hi,

 I am looking for some really super basic examples of MLlib (like a
 linear regression over a list of values) in Java. I have found a few, but I
 only saw them using JavaRDD... and not DataFrame.

 I was kind of hoping to take my current DataFrame and send them in
 MLlib. Am I too optimistic? Do you know/have any example like that?

 Thanks!

 jg


 Jean Georges Perrin
 j...@jgp.net / @jgperrin





>>>
>>
>


Re: Programmatic use of UDFs from Java

2016-07-22 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread.
Not sure if there has been any more recent work to support this.
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html


On Thu, Jul 21, 2016 at 10:10 AM, Everett Anderson  wrote:

> Hi,
>
> In the Java Spark DataFrames API, you can create a UDF, register it, and
> then access it by string name by using the convenience UDF classes in
> org.apache.spark.sql.api.java
> 
> .
>
> Example
>
> UDF1 testUdf1 = new UDF1<>() { ... }
>
> sqlContext.udf().register("testfn", testUdf1, DataTypes.LongType);
>
> DataFrame df2 = df.withColumn("new_col", *functions.callUDF("testfn"*,
> df.col("old_col")));
>
> However, I'd like to avoid registering these by name, if possible, since I
> have many of them and would need to deal with name conflicts.
>
> There are udf() methods like this that seem to be from the Scala API
> ,
> where you don't have to register everything by name first.
>
> However, using those methods from Java would require interacting with
> Scala's scala.reflect.api.TypeTags.TypeTag. I'm having a hard time
> figuring out how to create a TypeTag from Java.
>
> Does anyone have an example of using the udf() methods from Java?
>
> Thanks!
>
> - Everett
>
>