Re: lift coefficien

2016-07-22 Thread ndjido
Just apply Lift = Recall / Support formula with respect to a given threshold on your population distribution. The computation is quite straightforward. Cheers, Ardo > On 20 Jul 2016, at 15:05, pseudo oduesp wrote: > > Hi , > how we can claculate lift coeff from

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Pedro Rodriguez
You could either do monotonically_increasing_id or use a window function and rank. The first is a simple spark SQL function, data bricks has a pretty helpful post for how to use window functions (in this case the whole data set is the window). On Fri, Jul 22, 2016 at 12:20 PM, Marco Mistroni

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
I haven't used SparkR/R before, only Scala/Python APIs so I don't know for sure. I am guessing if things are in a DataFrame they were read either from some disk source (S3/HDFS/file/etc) or they were created from parallelize. If you are using the first, Spark will for the most part choose a

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Thanks Pedro, so to use sparkR dapply on SparkDataFrame, don't we need partition the DataFrame first? the example in doc doesn't seem to do this. Without knowing how it partitioned, how can one write the function to process each partition? Neil On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez

Re: spark and plot data

2016-07-22 Thread Taotao.Li
hi, pesudo, I've posted a blog before spark-dataframe-introduction , and for me, I use spark dataframe [ or RDD ] to do the logic calculation on all the datasets, and then transform the result into pandas dataframe, and make

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Benjamin Kim
It is included in Cloudera’s CDH 5.8. > On Jul 22, 2016, at 6:13 PM, Mail.com wrote: > > Hbase Spark module will be available with Hbase 2.0. Is that out yet? > >> On Jul 22, 2016, at 8:50 PM, Def_Os wrote: >> >> So it appears it should be possible

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Mail.com
Hbase Spark module will be available with Hbase 2.0. Is that out yet? > On Jul 22, 2016, at 8:50 PM, Def_Os wrote: > > So it appears it should be possible to use HBase's new hbase-spark module, if > you follow this pattern: >

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Def_Os
So it appears it should be possible to use HBase's new hbase-spark module, if you follow this pattern: https://hbase.apache.org/book.html#_sparksql_dataframes Unfortunately, when I run my example from PySpark, I get the following exception: > py4j.protocol.Py4JJavaError: An error occurred while

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread RK Aduri
I can see large number of collections happening on driver and eventually, driver is running out of memory. ( am not sure whether you have persisted any rdd or data frame). May be you would want to avoid doing so many collections or persist unwanted data in memory. To begin with, you may want

ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-22 Thread Ascot Moss
Hi Please help! When running random forest training phase in cluster mode, I got GC overhead limit exceeded. I have used two parameters when submitting the job to cluster --driver-memory 64g \ --executor-memory 8g \ My Current settings: (spark-defaults.conf) spark.executor.memory

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Hi Ted In general I want this application to use all available resources. I just bumped the driver memory to 2G. I also bumped the executor memory up to 2G. It will take a couple of hours before I know if this made a difference or not I am not sure if setting executor memory is a good idea. I

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
As of the most recent 0.6.0 release its partially alleviated, but still not great (compared to something like Jupyter). They can be "downloaded" but its only really meaningful in importing it back to Zeppelin. It would be great if they could be exported as HTML or PDF, but at present they can't

Re: How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread Pedro Rodriguez
You might look at monotonically_increasing_id() here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions instead of converting it to an RDD. since you pay a performance penalty for that. If you want to change the name you can do something like this (in scala

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Pedro Rodriguez
This should work and I don't think triggers any actions: df.rdd.partitions.length On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang wrote: > Seems no function does this in Spark 2.0 preview? > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Ted Yu
How much heap memory do you give the driver ? On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > Given I get a stack trace in my python notebook I am guessing the driver > is running out of memory? > > My app is simple it creates a list of dataFrames from

Re: spark and plot data

2016-07-22 Thread Gourav Sengupta
The biggest stumbling block to using Zeppelin has been that we cannot download the notebooks, cannot export them and certainly cannot sync them back to Github, without mind numbing and sometimes irritating hacks. Have those issues been resolved? Regards, Gourav On Fri, Jul 22, 2016 at 2:22 PM,

Re: Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
I realized that there's an error in the code. Corrected: from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream sc = SparkContext(appName="FileAutomation") # Create streaming context from

Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Andy Davidson
Given I get a stack trace in my python notebook I am guessing the driver is running out of memory? My app is simple it creates a list of dataFrames from s3://, and counts each one. I would not think this would take a lot of driver memory I am not running my code locally. Its using 12 cores. Each

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
Hi Inam i sorted it. i reply to all, in case anyone else follow the blog and get into the same issue - First off, the Environment.I have tested the sample using purely spark-1.6.1, no hive, no hadoop. I launched pyspark as follow pyspark --packages com.databricks:spark-csv_2.10:1.4.0 -

Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
Hi, I'm attempting to use broadcast variables to update stateful values used across the cluster for processing. Essentially, I have a function that is executed in .foreachRDD that updates the broadcast variable by calling unpersist() and then rebroadcasting. This works without issues when I

How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Seems no function does this in Spark 2.0 preview?

Distributed Matrices - spark mllib

2016-07-22 Thread Gourav Sengupta
Hi, I had a sparse matrix and I wanted to add the value of a particular row which is identified by a particular number. from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry mat = CoordinateMatrix(all_scores_df.select('ID_1','ID_2','value').rdd.map(lambda row:

Spark, Scala, and DNA sequencing

2016-07-22 Thread James McCabe
Hi! I hope this may be of use/interest to someone: Spark, a Worked Example: Speeding Up DNA Sequencing http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html James - To unsubscribe e-mail:

Re: ml models distribution

2016-07-22 Thread Chris Fregly
hey everyone- this concept of deploying your Spark ML Pipelines and Algos into Production (user-facing production) has been coming up a lot recently. so much so, that i've dedicated the last few months of my research and engineering efforts to build out the infrastructure to support this in a

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
Scaladoc is already in the code, just not the html docs On Fri, Jul 22, 2016 at 1:46 PM, Srikanth wrote: > Yeah, that's what I thought. We need to redefine not just restart. > Thanks for the info! > > I do see the usage of subscribe[K,V] in your DStreams example. > Looks

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
Yeah, that's what I thought. We need to redefine not just restart. Thanks for the info! I do see the usage of subscribe[K,V] in your DStreams example. Looks simple but its not very obvious how it works :-) I'll watch out for the docs and ScalaDoc. Srikanth On Fri, Jul 22, 2016 at 2:15 PM, Cody

Hive Exception

2016-07-22 Thread Inam Ur Rehman
Hi All I am really stuck here. i know this has been asked before but it just wont solve for me. I am using anaconda distribution 3.5 and and i have build spark-1.6.2 two times 1st time with hive and JDBC support through this command *mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive

Re: Load selected rows with sqlContext in the dataframe

2016-07-22 Thread sujeet jog
Thanks Todd. On Thu, Jul 21, 2016 at 9:18 PM, Todd Nist wrote: > You can set the dbtable to this: > > .option("dbtable", "(select * from master_schema where 'TID' = '100_0')") > > HTH, > > Todd > > > On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog wrote:

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
How did you build your spark distribution? Could you detail the steps? Hive afaik is dependent on hadoop. If you don't configure ur spark correctly it will assume hadoop is ur filesystem... I m not using hadoop or hive.u might want to get a cloudera distribution which has spark hadoop and hive

How to search on a Dataset / RDD <Row, Long >

2016-07-22 Thread VG
Any suggestions here please I basically need an ability to look up *name -> index* and *index -> name* in the code -VG On Fri, Jul 22, 2016 at 6:40 PM, VG wrote: > Hi All, > > I am really confused how to proceed further. Please help. > > I have a dataset created as

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Marco Mistroni
Hi So u u have a data frame, then use zipwindex and create a tuple I m not sure if df API has something useful for zip w index. But u can - get a data frame - convert it to rdd (there's a tordd ) - do a zip with index That will give u a rdd with 3 fields... I don't think you can update df

Re: Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
You're right, it's the save behavior... Oh well... I wanted something easy :( > On Jul 22, 2016, at 12:41 PM, Everett Anderson > wrote: > > Actually, sorry, my mistake, you're calling > > DataFrame df =

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
No, restarting from a checkpoint won't do it, you need to re-define the stream. Here's the jira for the 0.10 integration https://issues.apache.org/jira/browse/SPARK-12177 I haven't gotten docs completed yet, but there are examples at

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
In Spark 1.x, if we restart from a checkpoint, will it read from new partitions? If you can, pls point us to some doc/link that talks about Kafka 0.10 integ in Spark 2.0. On Fri, Jul 22, 2016 at 1:33 PM, Cody Koeninger wrote: > For the integration for kafka 0.8, you are

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-22 Thread Neil Chang
Thank you guys, it is the port issue. On Wed, Jul 20, 2016 at 11:03 AM, Igor Berman wrote: > in addition check what ip the master is binding to(with nestat) > > On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > >> Troubleshooting steps: >> >> $

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Great. thanks a ton for helping out on this Sean. I somehow messed this up (and was running in loops for last 2 hours ) thanks again -VG On Fri, Jul 22, 2016 at 11:28 PM, Sean Owen wrote: > You mark these provided, which is correct. If the version of Scala > provided at

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
Your error stems from spark.ml, and in your pom mllib is the only dependency that is 2.10. Is there a reason for this? IE, you tell maven mllib 2.10 is provided at runtime. Is 2.10 on the machine, or is 2.11? -Aaron From: VG Date: Friday, July 22, 2016 at 1:49 PM To: Sean

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Sean Owen
You mark these provided, which is correct. If the version of Scala provided at runtime differs, you'll have a problem. In fact you can also see you mixed Scala versions in your dependencies here. MLlib is on 2.10. On Fri, Jul 22, 2016 at 6:49 PM, VG wrote: > Sean, > > I am

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Sean, I am only using the maven dependencies for spark in my pom file. I don't have anything else. I guess maven dependency should resolve to the correct scala version .. isn;t it ? Any ideas. org.apache.spark spark-core_2.11 2.0.0-preview provided org.apache.spark spark-sql_2.11

Re: NoClassDefFoundError with ZonedDateTime

2016-07-22 Thread Jacek Laskowski
On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu wrote: > You can use this command (assuming log aggregation is turned on): > > yarn logs --applicationId XX I don't think it's gonna work for already-running application (and I wish I were mistaken since I needed it just yesterday) and

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Sean Owen
-dev Looks like you are mismatching the version of Spark you deploy on at runtime then. Sounds like it was built for Scala 2.10 On Fri, Jul 22, 2016 at 6:43 PM, VG wrote: > Using 2.0.0-preview using maven > So all dependencies should be correct I guess > > > org.apache.spark

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking desperately for the solution. I am facing en exception http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html plz help me.. I couldn't find any solution.. On

Re: Programmatic use of UDFs from Java

2016-07-22 Thread Everett Anderson
Thanks for the pointer, Bryan! Sounds like I was on the right track in terms of what's available for now. (And Gourav -- I'm certainly interested in migrating to Scala, but our team is mostly Java, Python, and R based right now!) On Thu, Jul 21, 2016 at 11:00 PM, Bryan Cutler

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Using 2.0.0-preview using maven So all dependencies should be correct I guess org.apache.spark spark-core_2.11 2.0.0-preview provided I see in maven dependencies that this brings in scala-reflect-2.11.4 scala-compiler-2.11.0 and so on On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici

Re: running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking desperately for the solution. I am facing en exception http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html plz help me.. I couldn't find any solution.. On

Re: Integration tests for Spark Streaming

2016-07-22 Thread Lars Albertsson
You can find useful discussions in the list archives. I wrote this, which might help you: https://www.mail-archive.com/user%40spark.apache.org/msg48032.html Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 Calendar: https://goo.gl/tV2hWF On Jun 29, 2016 07:02,

Re: ml models distribution

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking desperately for the solution. I am facing en exception http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html plz help me.. I couldn't find any solution.. plz

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking desperately for the solution. I am facing en exception http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html plz help me.. I couldn't find any solution..plz On

Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread Aaron Ilovici
What version of Spark/Scala are you running? -Aaron

Re: Rebalancing when adding kafka partitions

2016-07-22 Thread Cody Koeninger
For the integration for kafka 0.8, you are literally starting a streaming job against a fixed set of topicapartitions, It will not change throughout the job, so you'll need to restart the spark job if you change kafka partitions. For the integration for kafka 0.10 / spark 2.0, if you use

Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
I am getting the following error Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452) Any suggestions to resolve this

running jupyter notebook server Re: spark and plot data

2016-07-22 Thread Andy Davidson
Hi Pseudo I do not know much about zeppelin . What languages are you using? I have been doing my data exploration and graphing using python mostly because early on spark had good support for python. Its easy to collect() data as a local PANDAS object. I think at this point R should work well.

Re: Creating a DataFrame from scratch

2016-07-22 Thread Everett Anderson
Actually, sorry, my mistake, you're calling DataFrame df = sqlContext.createDataFrame(data, org.apache.spark.sql.types.NumericType.class); and giving it a list of objects which aren't NumericTypes, but the wildcards in the signature let it happen. I'm curious what'd happen if you gave it

Rebalancing when adding kafka partitions

2016-07-22 Thread Srikanth
Hello, I'd like to understand how Spark Streaming(direct) would handle Kafka partition addition? Will a running job be aware of new partitions and read from it? Since it uses Kafka APIs to query offsets and offsets are handled internally. Srikanth

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All, Any suggestions for this Regards, VG On Fri, Jul 22, 2016 at 6:40 PM, VG wrote: > Hi All, > > I am really confused how to proceed further. Please help. > > I have a dataset created as follows: > Dataset b = sqlContext.sql("SELECT bid, name FROM business"); > > Now I

Re: Fast database with writes per second and horizontal scaling

2016-07-22 Thread Marco Colombo
Yes, this is not a question for spark user list. Btw, in db world, performances depend also on which data you have and schema you want to use. First put a target, then evaluate technology. Cassandra can be really fast di you put data via sstableloader or copy rather then insert line by line.

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Adding this to build.sbt worked. Thanks Jacek assemblyMergeStrategy in assembly := { case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first case "application.conf"=>

Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
Can someone please help here. I tried both scala 2.10 and 2.11 on the system On Fri, Jul 22, 2016 at 7:59 PM, VG wrote: > I am using version 2.0.0-preview > > > > On Fri, Jul 22, 2016 at 7:47 PM, VG wrote: > >> I am running into the following error when

Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
I am trying to build a DataFrame from a list, here is the code: private void start() { SparkConf conf = new SparkConf().setAppName("Data Set from Array").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread Jacek Laskowski
See https://github.com/sbt/sbt-assembly#merge-strategy Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jul 22, 2016 at 4:23 PM, janardhan shetty

Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
I am using version 2.0.0-preview On Fri, Jul 22, 2016 at 7:47 PM, VG wrote: > I am running into the following error when running ALS > > Exception in thread "main" java.lang.NoSuchMethodError: >

WrappedArray in SparkSQL DF

2016-07-22 Thread KhajaAsmath Mohammed
Hi, I am reading JSON file and I am facing difficulties trying to get individula elements for this array. does anyone know how to get the elements from WrappedArray(WrappedArray(String)) Schema: ++ |rows| ++ |[WrappedArray(Bon...|

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Changed to sbt.0.14.3 and it gave : [info] Packaging /Users/jshetty/sparkApplications/MainTemplate/target/scala-2.11/maintemplate_2.11-1.0.jar ... java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF at java.util.zip.ZipOutputStream.putNextEntry(ZipOutputStream.java:233) Do we

ml ALS.fit(..) issue

2016-07-22 Thread VG
I am running into the following error when running ALS Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452) at

Is spark-submit a single point of failure?

2016-07-22 Thread Sivakumaran S
Hello, I have a spark streaming process on a cluster ingesting a realtime data stream from Kafka. The aggregated, processed output is written to Cassandra and also used for dashboard display. My question is - If the node running the driver program fails, I am guessing that the entire process

Re: spark and plot data

2016-07-22 Thread Pedro Rodriguez
Zeppelin works great. The other thing that we have done in notebooks (like Zeppelin or Databricks) which support multiple types of spark session is register Spark SQL temp tables in our scala code then escape hatch to python for plotting with seaborn/matplotlib when the built in plots are

Re: How can we control CPU and Memory per Spark job operation..

2016-07-22 Thread Pedro Rodriguez
Sorry, wasn’t very clear (looks like Pavan’s response was dropped from list for some reason as well). I am assuming that: 1) the first map is CPU bound 2) the second map is heavily memory bound To be specific, lets saw you are using 4 m3.2xlarge instances which have 8 CPUs and 30GB of ram each

Re: ml models distribution

2016-07-22 Thread Sean Owen
No there isn't anything in particular, beyond the various bits of serialization support that write out something to put in your storage to begin with. What you do with it after reading and before writing is up to your app, on purpose. If you mean you're producing data outside the model that your

Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All, I am really confused how to proceed further. Please help. I have a dataset created as follows: Dataset b = sqlContext.sql("SELECT bid, name FROM business"); Now I need to map each name with a unique index and I did the following JavaPairRDD indexedBId = business.javaRDD()

Re: ml models distribution

2016-07-22 Thread Sergio Fernández
Hi Sean, On Fri, Jul 22, 2016 at 12:52 PM, Sean Owen wrote: > > If you mean, how do you distribute a new model in your application, > then there's no magic to it. Just reference the new model in the > functions you're executing in your driver. > > If you implemented some

Re: Create dataframe column from list

2016-07-22 Thread Inam Ur Rehman
Hello guys..i know its irrelevant to this topic but i've been looking desperately for the solution. I am facing en exception http://apache-spark-user-list.1001560.n3.nabble.com/how-to-resolve-you-must-build-spark-with-hive-exception-td27390.html plz help me.. I couldn't find any solution.. On

Re: what contribute to Task Deserialization Time

2016-07-22 Thread Silvio Fiorito
Are you referencing member variables or other objects of your driver in your transformations? Those would have to be serialized and shipped to each executor when that job kicks off. On 7/22/16, 8:54 AM, "Jacek Laskowski" wrote: Hi, I can't specifically answer your question,

Re: what contribute to Task Deserialization Time

2016-07-22 Thread Jacek Laskowski
Hi, I can't specifically answer your question, but my understanding of Task Deserialization Time is that it's time to deserialize a serialized task from the driver before it gets run. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L236

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Marco - I like the idea of sticking with DataFrames ;) > On Jul 22, 2016, at 7:07 AM, Marco Mistroni wrote: > > Hello Jean > you can take ur current DataFrame and send them to mllib (i was doing that > coz i dindt know the ml package),but the process is littlebit

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Is scala version also the culprit? 2.10 and 2.11.8 Also Can you give the steps to create sbt package command just like maven install from within intellij to create jar file in target directory ? On Jul 22, 2016 5:16 AM, "Jacek Laskowski" wrote: > Hi, > > There has never been

Re: Create dataframe column from list

2016-07-22 Thread Ashutosh Kumar
http://stackoverflow.com/questions/36382052/converting-list-to-column-in-spark On Fri, Jul 22, 2016 at 5:15 PM, Divya Gehlot wrote: > Hi, > Can somebody help me by creating the dataframe column from the scala list . > Would really appreciate the help . > > Thanks , >

Re: getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-22 Thread Jacek Laskowski
Hi, It appears that lag didn't work properly, right? I'm new to it, and remember that in Scala you'd need to define a WindowSpec. I don't see one in your SQL query. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark

Re: Create dataframe column from list

2016-07-22 Thread Jacek Laskowski
Hi, Doh, just rebuilding Spark so...writing off the top of my head. val cols = Seq("hello", "world") val columns = cols.map(Column.col) See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: Unresolved dependencies while creating spark application Jar

2016-07-22 Thread Jacek Laskowski
Hi, There has never been 0.13.8 for sbt-assembly AFAIK. Use 0.14.3 and start over. See https://github.com/jaceklaskowski/spark-workshop/tree/master/solutions/spark-external-cluster-manager for a sample Scala/sbt project with Spark 2.0 RC5. Pozdrawiam, Jacek Laskowski

Re: ml models distribution

2016-07-22 Thread Jacek Laskowski
Hehe, Sean. I knew that (and I knew the answer), but meant to ask a co-question to help to find the answer *together* :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski

Unresolved dependencies while creating spark application Jar

2016-07-22 Thread janardhan shetty
Hi, I was setting up my development environment. Local Mac laptop setup IntelliJ IDEA 14CE Scala Sbt (Not maven) Error: $ sbt package [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn]

Create dataframe column from list

2016-07-22 Thread Divya Gehlot
Hi, Can somebody help me by creating the dataframe column from the scala list . Would really appreciate the help . Thanks , Divya

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
Hello Jean you can take ur current DataFrame and send them to mllib (i was doing that coz i dindt know the ml package),but the process is littlebit cumbersome 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint] 2. run your ML model i'd suggest you stick to DataFrame + ml package :) hth

Re: ml models distribution

2016-07-22 Thread Sean Owen
Machine Learning If you mean, how do you distribute a new model in your application, then there's no magic to it. Just reference the new model in the functions you're executing in your driver. If you implemented some other manual way of deploying model info, just do that again. There's no

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Bryan - I keep forgetting about the examples... This is almost it :) I can work with that :) > On Jul 22, 2016, at 1:39 AM, Bryan Cutler wrote: > > Hi JG, > > If you didn't know this, Spark MLlib has 2 APIs, one of which uses > DataFrames. Take a look at this

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Hi Jules, Thanks but not really: I know what DataFrames are and I actually use them - specially as the RDD will slowly fade. A lot of the example I see are focusing on cleaning / prep the data, which is an important part, but not really on "after"... Sorry if I am not completely clear. > On

Re: ml models distribution

2016-07-22 Thread Jacek Laskowski
Hi, What's a ML model? (I'm sure once we found out the answer you'd know the answer for your question :)) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri,

ml models distribution

2016-07-22 Thread Sergio Fernández
Hi, I have one question: How is the ML models distribution done across all nodes of a Spark cluster? I'm thinking about scenarios where the pipeline implementation does not necessary need to change, but the models have been upgraded. Thanks in advance. Best regards, -- Sergio Fernández

Re: spark and plot data

2016-07-22 Thread Marco Colombo
Take a look at zeppelin http://zeppelin.apache.org Il giovedì 21 luglio 2016, Andy Davidson ha scritto: > Hi Pseudo > > Plotting, graphing, data visualization, report generation are common needs > in scientific and enterprise computing. > > Can you tell me more

Re: How spark decides whether to do BroadcastHashJoin or SortMergeJoin

2016-07-22 Thread Matthias Niehoff
Hi, there is a property you can set. Quoting the docs ( http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options ) spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB) Configures the maximum size in bytes for a table that will be broadcast to all worker nodes

Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi, We are also running Connected Components test with GraphX. We ran experiments using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a single executor per machine. We got this result: Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 100

Re: GraphX performance and settings

2016-07-22 Thread B YL
Hi, We are also running Connected Components test with GraphX. We ran experiments using Spark 1.6.1 on machine which have 16 cores with 2-way and run only a single executor per machine. We got this result: Facebook-like graph with 2^24 edges, using 4 executors with 90GB each, it took 100

Re: MLlib, Java, and DataFrame

2016-07-22 Thread VG
Interesting. thanks for this information. On Fri, Jul 22, 2016 at 11:26 AM, Bryan Cutler wrote: > ML has a DataFrame based API, while MLlib is RDDs and will be deprecated > as of Spark 2.0. > > On Thu, Jul 21, 2016 at 10:41 PM, VG wrote: > >> Why do we

Re: Programmatic use of UDFs from Java

2016-07-22 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread. Not sure if there has been any more recent work to support this. http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html On Thu, Jul 21, 2016 at 10:10 AM, Everett