Re: Mllib Logistic Regression performance relative to Mahout

2016-02-28 Thread Yashwanth Kumar
Hi, If your features are numeric, try feature scaling and feed it to Spark Logistic Regression, It might increase rate% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346p26358.html Sent from the

Re: Spark Integration Patterns

2016-02-28 Thread Yashwanth Kumar
Hi, To connect to Spark from a remote location and submit jobs, you can try Spark - Job Server.Its been open sourced now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354p26357.html Sent from the Apache Spark User List

Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
Hi Jean, DataFrame is connected with SQLContext which is connected with SparkContext, so I think it's impossible to run `model.transform` without touching Spark. I think what you need is model should support prediction on single instance, then you can make prediction without Spark. You can track

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-28 Thread Jules Damji
Hello Ronald, Since you have placed the file under HDFS, you might same change the path name to: val lines = sc.textFile("hdfs://user/taylor/Spark/Warehouse.java") Sent from my iPhone Pardon the dumb thumb typos :) > On Feb 28, 2016, at 9:36 PM, Taylor, Ronald C

Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions

2016-02-28 Thread Hossein Vatani
Hi, Affects Version/s:1.6.0 Component/s:PySpark I faced below exception when I tried to run http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=filter#pyspark.sql.SQLContext.jsonRDD samples: Exception: Python in worker has different version 2.7 than that in driver 3.5,

a basic question on first use of PySpark shell and example, which is failing

2016-02-28 Thread Taylor, Ronald C
Hello folks, I am a newbie, and am running Spark on a small Cloudera CDH 5.5.1 cluster at our lab. I am trying to use the PySpark shell for the first time. and am attempting to duplicate the documentation example of creating an RDD which I called "lines" using a text file. I placed a a

RE: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mohammed Guller
Hi Ashok, Another book recommendation (I am the author): “Big Data Analytics with Spark” The first half of the book is specifically written for people just getting started with Big Data and Spark. Mohammed Author: Big Data Analytics with

Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
Hi Raj, If you choose JSON as the storage format, Spark SQL will store VectorUDT as Array of Double. So when you load back to memory, it can not be recognized as Vector. One workaround is storing the DataFrame as parquet format, it will be loaded and recognized as expected.

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Suhaas Lang
Thanks, Jules! On Feb 28, 2016 7:47 PM, "Jules Damji" wrote: > Suhass, > > When I referred to interactive shells, I was referring the the Scala & > Python interactive language shells. Both Python & Scala come with > respective interacive shells. By just typing “python” or

Error when trying to overwrite a partition dynamically in Spark SQL

2016-02-28 Thread SRK
Hi, I am getting an error when trying to overwrite a partition dynamically. Following is the code and the error. Any idea as to why this is happening? test.write.partitionBy("dtPtn","idPtn").mode(SaveMode.Overwrite).format("parquet").save("/user/test/sessRecs") 16/02/28 18:02:55 ERROR

Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Thank you Ashwin. On Sun, Feb 28, 2016 at 7:19 PM, Ashwin Giridharan wrote: > Hi Vishnu, > > A partition will either be in memory or in disk. > > -Ashwin > On Feb 28, 2016 15:09, "Vishnu Viswanath" > wrote: > >> Hi All, >> >> I have a

Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Ashwin Giridharan
Hi Vishnu, A partition will either be in memory or in disk. -Ashwin On Feb 28, 2016 15:09, "Vishnu Viswanath" wrote: > Hi All, > > I have a question regarding Persistence (MEMORY_AND_DISK) > > Suppose I am trying to persist an RDD which has 2 partitions and only 1

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Jules Damji
Suhass, When I referred to interactive shells, I was referring the the Scala & Python interactive language shells. Both Python & Scala come with respective interacive shells. By just typing “python” or “scala” (assume the installation bin directory is in your $PATH), it will put fire up the

Pattern Matching over a Sequence of rows using Spark

2016-02-28 Thread Jerry Lam
Hi spark users and developers, Anyone has experience developing pattern matching over a sequence of rows using Spark? I'm talking about functionality similar to matchpath in Hive or match_recognize in Oracle DB. It is used for path analysis on clickstream data. If you know of any libraries that

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-28 Thread Jeff Zhang
data skew might be possible, but not the common case. I think we should design for the common case, for the skew case, we may can set some parameter of fraction to allow user to tune it. On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin wrote: > But sometimes you might have skew

Re: PySpark : couldn't pickle object of type class T

2016-02-28 Thread Jeff Zhang
Hi Anoop, I don't see the exception you mentioned in the link. I can use spark-avro to read the sample file users.avro in spark successfully. Do you have the details of the union issue ? On Sat, Feb 27, 2016 at 10:05 AM, Anoop Shiralige wrote: > Hi Jeff, > > Thank

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Suhaas Lang
Jules, Could you please post links to these interactive shells for Python and Scala? On Feb 28, 2016 5:32 PM, "Jules Damji" wrote: > Hello Ashoka, > > "Learning Spark," from O'Reilly, is certainly a good start, and all basic > video tutorials from Spark Summit Training,

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Chris Fregly
for hands-on, check out the end-to-end reference data pipeline available either from the github or docker repo's described here: http://advancedspark.com/ i use these assets to training folks of all levels of Spark knowledge. also, some relevant videos and slideshare presentations, but might be

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Mich Talebzadeh
In my opinion the best way to learn something is trying it on the spot. As suggested if you have Hadoop, Hive and Spark installed and you are OK with SQL then you will have to focus on Scala and Spark pretty much. Your best bet is interactive work through Spark shell with Scala, understanding

Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Vishnu Viswanath
Hi All, I have a question regarding Persistence (MEMORY_AND_DISK) Suppose I am trying to persist an RDD which has 2 partitions and only 1 partition can be fit in memory completely but some part of partition 2 can also be fit, will spark keep the portion of partition 2 in memory and rest in disk,

Re: Spark Integration Patterns

2016-02-28 Thread ayan guha
I believe you are looking for something like Spark Jobserver for running jobs & JDBC server for accessing data? I am curious to know more about it, any further discussion will be very helpful On Mon, Feb 29, 2016 at 6:06 AM, Luciano Resende wrote: > One option we have

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Jules Damji
Hello Ashoka, "Learning Spark," from O'Reilly, is certainly a good start, and all basic video tutorials from Spark Summit Training, "Spark Essentials", are excellent supplementary materials. And the best (and most effective) way to teach yourself is really firing up the spark-shell or pyspark

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Nicos
I agree with suggestion to start with "Learning Spark" to further forge your knowledge of Spark fundamentals. "Advanced Analytics with Spark" has good practical reinforcement of what you learn from the previous book. Though it is a bit advanced, in my opinion some practical/real applications

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-28 Thread Shixiong(Ryan) Zhu
This is because the Snappy library cannot load the native library. Did you forget to install the snappy native library in your new machines? On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand wrote: > Any insights on this ? > > On Fri, Feb 26, 2016 at 1:21 PM, Abhishek

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-28 Thread Shixiong(Ryan) Zhu
Sorry that I forgot to tell you that you should also call `rdd.count()` for "reduceByKey" as well. Could you try it and see if it works? On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand wrote: > Hi Ryan, > > I am using mapWithState after doing reduceByKey. > > I am right

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ted Yu
http://www.amazon.com/Scala-Spark-Alexy-Khrabrov/dp/1491929286/ref=sr_1_1?ie=UTF8=1456696284=8-1=spark+dataframe There is another one from Wiley (to be published on March 21): "Spark: Big Data Cluster Computing in Production," written by Ilya Ganelin, Brennon York, Kai Sasaki, and Ema Orhian On

Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Ashok Kumar
  Hi Gurus, Appreciate if you recommend me a good book on Spark or documentation for beginner to moderate knowledge I very much like to skill myself on transformation and action methods. FYI, I have already looked at examples on net. However, some of them not clear at least to me. Warmest

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Koert Kuipers
i find it particularly confusing that a new memory management module would change the locations. its not like the hash partitioner got replaced. i can switch back and forth between legacy and "new" memory management and see the distribution change... fully reproducible On Sun, Feb 28, 2016 at

Re: Spark Integration Patterns

2016-02-28 Thread Luciano Resende
One option we have used in the past is to expose spark application functionality via REST, this would enable python or any other client that is capable of doing a HTTP request to integrate with your Spark application. To get you started, this might be a useful reference

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
I'm not sure on Python, not expert in that area. Based on pr, https://github.com/apache/spark/pull/8318, I believe you are correct that Spark would need to be installed for you to be able to currently leverage the pyspark package. On Sun, Feb 28, 2016 at 1:38 PM, moshir mikael

Re: output the datas(txt)

2016-02-28 Thread Chandeep Singh
Here is what you can do. // Recreating your RDD val a = Array(Array(1, 2, 3), Array(2, 3, 4), Array(3, 4, 5), Array(6, 7, 8)) val b = sc.parallelize(a) val c = b.map(x => (x(0) + " " + x(1) + " " + x(2))) // Collect to c.collect() —> res3: Array[String] = Array(1 2 3, 2 3 4, 3 4 5, 6 7 8)

Re: Spark Integration Patterns

2016-02-28 Thread Todd Nist
Define your SparkConfig to set the master: val conf = new SparkConf().setAppName(AppName) .setMaster(SparkMaster) .set() Where SparkMaster = "spark://SparkServerHost:7077". So if your spark server hostname it "RADTech" then it would be "spark://RADTech:7077". Then when you create

Re: output the datas(txt)

2016-02-28 Thread Don Drake
If you use the spark-csv package: $ spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 scala> val df = sc.parallelize(Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))).map(x => (x(0), x(1), x(2))).toDF() df: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int] scala>

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Lior Chaga
Hi, I've experienced a similar problem upgrading from spark 1.4 to spark 1.6. The data is not evenly distributed across executors, but in my case it also reproduced with legacy mode. Also tried 1.6.1 rc-1, with same results. Still looking for resolution. Lior On Fri, Feb 19, 2016 at 2:01 AM,

Spark Integration Patterns

2016-02-28 Thread mms
Hi,I cannot find a simple example showing how a typical application can 'connect' to a remote spark cluster and interact with it.Let's say I have a Python web application hosted somewhere *outside *a spark cluster, with just python installed on it.How can I talk to Spark without using a notebook,

Java/Spark Library for interacting with Spark API

2016-02-28 Thread hbogert
Hi, Does anyone know of a Java/Scala library (not simply a HTTP library) for interacting with Spark through its REST/HTTP API? My “problem” is that interacting through REST induces a lot of work mapping the JSON to sensible Spark/Scala objects. So a simple example, I hope there is a library