How to handle categorical variables in Spark MLlib?

2015-12-22 Thread Hokam Singh Chauhan
Hi, We have one use case in which we need to handle the categorical variables in SVM, Regression and Logistic regression models(MLlib not ML) for scoring. We are getting the possible category values against each category variable. So how the string value of categorical variable can be converted

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Arunkumar Pillai
Hi Yanbo, This is just to find f stats and adjusted Rsquare. On Tue, Dec 22, 2015 at 4:25 PM, Yanbo Liang wrote: > Hi Arunkumar, > > Could you tell me the cause of getting SSerr, SStot and SSreg? > Traditional we use explainedVariance, meanAbsoluteError, meanSquaredError, >

Re: Classification model method not found

2015-12-22 Thread Ted Yu
Looks like you should define ctor for ExtendedLR which accepts String (the uid). Cheers On Tue, Dec 22, 2015 at 1:04 PM, njoshi wrote: > Hi, > > I have a custom extended LogisticRegression model which I want to test > against a parameter grid search. I am running as

How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-22 Thread raja kbv
Hi, I am new to spark. I have a text filewith below structure.  (employeeID: Int, Name: String, ProjectDetails:JsonObject{[{ProjectName, Description, Duriation, Role}]})Eg:(123456, Employee1, {“ProjectDetails”:[     {“ProjectName”: “Web

building a distributed k-d tree with spark

2015-12-22 Thread Russ
I have developed algorithms to build and search a distributed k-d tree using Scala/Spark.  Erik Erlandson of Red Hat has encouraged me to share the URL for the pre-print of my manuscript and source code: http://arxiv.org/abs/1512.06389 I will be revising this manuscript and refining the

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
Works like a charm. Thanks a lot! -jan On 22 Dec 2015, at 23:08, Michael Armbrust > wrote: You need to say .mode("append") if you want to append to existing data. On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma

Re: spark-submit is ignoring "--executor-cores"

2015-12-22 Thread Siva
Thanks a lot Saisai and Zhan, I see DefaultResourceCalculator currently being used for Capacity scheduler. We will change it to DominantResourceCalculator. Thanks, Sivakumar Bhavanari. On Mon, Dec 21, 2015 at 5:56 PM, Zhan Zhang wrote: > BTW: It is not only a Yarn-webui

Classification model method not found

2015-12-22 Thread njoshi
Hi, I have a custom extended LogisticRegression model which I want to test against a parameter grid search. I am running as follows: / val exLR = new ExtendedLR() .setMaxIter(100) .setFitIntercept(true) /* * Cross Validator parameter grid */ val paramGrid = new

Missing dependencies when submitting scala app

2015-12-22 Thread Daniel Valdivia
Hi, I'm trying to figure out how to bundle dependendies with a scala application, so far my code was tested successfully on the spark-shell however now that I'm trying to run it as a stand alone application which I'm compilin with sbt is yielding me the error: java.lang.NoSuchMethodError:

Re: Spark data frame

2015-12-22 Thread Dean Wampler
More specifically, you could have TBs of data across thousands of partitions for a single RDD. If you call collect(), BOOM! Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Do existing R packages work with SparkR data frames

2015-12-22 Thread Lan
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Re: should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-22 Thread Jeff Zhang
>>> DataFrame transformerdDF = df.withColumn(fieldName, newCol); >>> org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS transformedByUDF#3]; I don't think it's a spark bug, it is your application bug. I

Which Hive version should be used with Spark 1.5.2?

2015-12-22 Thread Arthur Chan
Hi, I plan to upgrade from 1.4.1 (+ Hive 1.1.0) to 1.5.2, is there any upgrade document available about the upgrade especially which Hive version should be upgraded too? Regards

Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread b2k70
I see in the Spark SQL documentation that a temporary table can be created directly onto a remote PostgreSQL table. CREATE TEMPORARY TABLE USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:postgresql:///", dbtable "impressions" ); When I run this against our PostgreSQL server, I get the

Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl
Hi, Here is my javacode; SparkConf sparkConf = Constance.getSparkConf(); JavaSparkContext sc = new JavaSparkContext(sparkConf); SQLContext sql = new SQLContext(sc); HiveContext sqlContext = new HiveContext(sc.sc()); List fields = new

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
The postgres jdbc driver needs to be added to the classpath of your spark workers. You can do a search for how to do that (multiple ways). 2015-12-22 17:22 GMT-08:00 b2k70 : > I see in the Spark SQL documentation that a temporary table can be created > directly onto a

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Zhan Zhang
SQLContext is in driver side, and I don’t think you can use it in executors. How to provide lookup functionality in executors really depends on how you would use them. Thanks. Zhan Zhang On Dec 22, 2015, at 4:44 PM, SRK wrote: > Hi, > > Can SQL Context be used

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Ted Yu
bq. be able to lookup from inside MapPartitions based on a key Please describe your use case in bit more detail. One possibility is to use NoSQL database such as HBase. There're several choices for Spark HBase connector. Cheers On Tue, Dec 22, 2015 at 4:51 PM, Zhan Zhang

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim
Hi Stephen, I forgot to mention that I added these lines below to the spark-default.conf on the node with Spark SQL Thrift JDBC/ODBC Server running on it. Then, I restarted it. spark.driver.extraClassPath=/usr/share/java/postgresql-9.3-1104.jdbc41.jar

Re: Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl
I know the reason now I change the metastore with javacode But the thriftserver cache the metastore in memory,it need refresh from the mysql; but how?? -- View this message in context:

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar
Hi Ted, Thank you. Yes. This issue is related to https://issues.apache.org/jira/browse/SPARK-4170 Regards, Rajesh On Wed, Dec 23, 2015 at 12:09 AM, Ted Yu wrote: > This should be related: > https://issues.apache.org/jira/browse/SPARK-4170 > > On Tue, Dec 22, 2015 at 9:34

Re: Which Hive version should be used with Spark 1.5.2?

2015-12-22 Thread Ted Yu
Please see SPARK-8064 On Tue, Dec 22, 2015 at 6:17 PM, Arthur Chan wrote: > Hi, > > I plan to upgrade from 1.4.1 (+ Hive 1.1.0) to 1.5.2, is there any > upgrade document available about the upgrade especially which Hive version > should be upgraded too? > > Regards >

Re: Classification model method not found

2015-12-22 Thread Nikhil Joshi
Hi Ted, Thanks. That fixed the issue :). Nikhil On Tue, Dec 22, 2015 at 1:14 PM, Ted Yu wrote: > Looks like you should define ctor for ExtendedLR which accepts String > (the uid). > > Cheers > > On Tue, Dec 22, 2015 at 1:04 PM, njoshi wrote: >

Can SqlContext be used inside mapPartitions

2015-12-22 Thread SRK
Hi, Can SQL Context be used inside mapPartitions? My requirement is to register a set of data from hdfs as a temp table and to be able to lookup from inside MapPartitions based on a key. If it is not supported, is there a different way of doing this? Thanks! -- View this message in context:

Re: Missing dependencies when submitting scala app

2015-12-22 Thread Jeff Zhang
It might be jar conflict issue. Spark has dependency org.json4s.jackson, do you also specify org.json4s.jackson in your sbt dependency but with a different version ? On Wed, Dec 23, 2015 at 6:15 AM, Daniel Valdivia wrote: > Hi, > > I'm trying to figure out how to bundle

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch
HI Benjamin, yes by adding to the thrift server then the create table would work. But querying is performed by the workers: so you need to add to the classpath of all nodes for reads to work. 2015-12-22 18:35 GMT-08:00 Benjamin Kim : > Hi Stephen, > > I forgot to mention

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim
Stephen, Let me confirm. I just need to propagate these settings I put in spark-defaults.conf to all the worker nodes? Do I need to do the same with the PostgreSQL driver jar file too? If so, is there a way to have it read from HDFS rather than copying out to the cluster manually. Thanks for

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
So sorry , should be Seq, not sql . thanks for your help. Ricky Ou(欧 锐) From: Dean Wampler Date: 2015-12-23 00:46 To: our...@cnsuning.com CC: user; t...@databricks.com Subject: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list? There are

Re: Spark data frame

2015-12-22 Thread Dean Wampler
You can call the collect() method to return a collection, but be careful. If your data is too big to fit in the driver's memory, it will crash. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Michael Armbrust
You need to say .mode("append") if you want to append to existing data. On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma wrote: > Well you are right. Having a quick glance at the source[1] I see that the > path creation does not consider the partitions. > > It tries to create

Re: Spark data frame

2015-12-22 Thread Silvio Fiorito
Michael, collect will bring down the results to the driver JVM, whereas the RDD or DataFrame would be cached on the executors (if it is cached). So, as Dean said, the driver JVM needs to have enough memory to store the results of collect. Thanks, Silvio From: Michael Segel

Do existing R packages work with SparkR data frames

2015-12-22 Thread Duy Lan Nguyen
Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu
This should be related: https://issues.apache.org/jira/browse/SPARK-4170 On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help

Spark data frame

2015-12-22 Thread Gaurav Agarwal
We are able to retrieve data frame by filtering the rdd object . I need to convert that data frame into java pojo. Any idea how to do that

Re: val listRDD =ssc.socketTextStream(localhost,9999) on Yarn

2015-12-22 Thread Shixiong Zhu
Just replace `localhost` with a host name that can be accessed by Yarn containers. Best Regards, Shixiong Zhu 2015-12-22 0:11 GMT-08:00 prasadreddy : > How do we achieve this on yarn-cluster mode > > Please advice. > > Thanks > Prasad > > > > -- > View this message in

How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-22 Thread raja kbv
Hi, I am new to Spark. I have a text filewith below structure.  (employeeID: Int, Name: String, ProjectDetails:JsonObject{[{ProjectName, Description, Duriation, Role}]})Eg:(123456, Employee1, {“ProjectDetails”:[     {“ProjectName”: “Web

Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-22 Thread superbee84
Hi All, I'm new to Spark. Before I describe the problem, I'd like to let you know the role of the machines that organize the cluster and the purpose of my work. By reading and follwing the instructions and tutorials, I successfully built up a cluster with 7 CentOS-6.5 machines. I installed

running lda in spark throws exception

2015-12-22 Thread Li Li
I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2. it throws exception in line: Matrix topics = ldaModel.topicsMatrix(); But in yarn job history ui, it's successful. What's wrong with it? I submit job with .bin/spark-submit --class Myclass \ --master yarn-client \

running spark application encouter an error (maven relative)

2015-12-22 Thread zml张明磊
Hi, I am trying to figure out how maven works. When I add a dependency to my existing pom.xml and rebuild my spark application project. BUILD SUCCESS I can get from the console. However, when I running the spark application, the spark-shell was not happy and directly give me a message

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das
I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 11:19 PM, Abhinav M Kulkarni <

Re: Regarding spark in nemory

2015-12-22 Thread Ted Yu
If I understand your question correctly, the answer is yes. You can retrieve rows of the rdd which are distributed across the nodes. > On Dec 22, 2015, at 7:34 PM, Gaurav Agarwal wrote: > > If I have 3 more cluster and spark is running there .if I load the records >

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
as the following code modified form StateflNetwork in exampile package if (args.length < 2) { System.err.println("Usage: StatefulNetworkWordBiggest3Vaules ") System.exit(1) } /** * state is min(max(3)) */ val updateFunc = (key:String,values: Seq[Seq[Int]], state: Seq[Int]) => {

Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-22 Thread Akhil Das
Why not create a custom dstream and generate the data from there itself instead of spark connecting to a socket server which will be fed by another twitter client? Thanks Best Regards On Sat, Dec 19, 2015 at 5:47 PM, Amir

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das
I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 8:33 AM, Abhinav M Kulkarni <

Re: Memory allocation for Broadcast values

2015-12-22 Thread Akhil Das
If you are creating a huge map on the driver, then spark.driver.memory should be set to a higher value to hold your map. Since you are going to broadcast this map, your spark executors must have enough memory to hold this map as well which can be set using the spark.executor.memory, and

error while defining custom schema in Spark 1.5.0

2015-12-22 Thread Divya Gehlot
Hi, I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark 1.5.0. I working on custom schema and getting error import org.apache.spark.sql.hive.HiveContext >> >> scala> import org.apache.spark.sql.hive.orc._ >> import org.apache.spark.sql.hive.orc._ >> >> scala> import

UnsupportedOperationException Schema for type String => Int is not supported

2015-12-22 Thread zml张明磊
Hi, Spark-version : 1.4.1 Runing the code getting the following error, how can I fix the code and run collectly ? I don’t know why the schema don’t support this type system. If I use callUDF instead of udf. Everything is good. Thanks, Minglei. val index:(String => (String => Int)) =

Apache spark certification pass percentage ?

2015-12-22 Thread kali.tumm...@gmail.com
Hi All, Does anyone know pass percentage for Apache spark certification exam ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-spark-certification-pass-percentage-tp25761.html Sent from the Apache Spark User List mailing list archive

driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi
I have streaming app (pyspark 1.5.2 on yarn) that's crashing due to driver (jvm part, not python) OOM (no matter how big heap is assigned, eventually runs out). When checking the heap it is all taken by "byte" items of io.netty.buffer.PoolThreadCache. The number of

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu
This might be related but the jmap output there looks different: http://stackoverflow.com/questions/32537965/huge-number-of-io-netty-buffer-poolthreadcachememoryregioncacheentry-instances On Tue, Dec 22, 2015 at 2:59 AM, Antony Mayi wrote: > I have streaming app

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
Igor, I think it's available. After I extract the jar file, I see a directory with class files that look very relevant in "/com/mysql/jdbc". After reading this, I started to wonder if MySQL connector was really the problem. Perhaps it's something to do with SQLcontext? I just wired a test

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Yanbo Liang
Hi Arunkumar, Could you tell me the cause of getting SSerr, SStot and SSreg? Traditional we use explainedVariance, meanAbsoluteError, meanSquaredError, rootMeanSquaredError and r2 as metrics of LinearRegression. Although actually you can get SSerr, SStot and SSreg from the composition of above

Re: Fat jar can't find jdbc

2015-12-22 Thread Vijay Kiran
Can you paste your libraryDependencies from build.sbt ? ./Vijay > On 22 Dec 2015, at 06:12, David Yerrington wrote: > > Hi Everyone, > > I'm building a prototype that fundamentally grabs data from a MySQL instance, > crunches some numbers, and then moves it on down the

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
Sure here it is: import AssemblyKeys._ assemblySettings // assemblyJarName in assembly := "recommender.jar" test in assembly := {} organization := "com.example" version := "0.1" scalaVersion := "2.11.6" scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")

Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
Hi, I'm stuck with writing partitioned data to hdfs. Example below ends up with 'already exists' -error. I'm wondering how to handle streaming use case. What is the intended way to write streaming data to hdfs? What am I missing? cheers, -jan import com.databricks.spark.avro._ import

val listRDD =ssc.socketTextStream(localhost,9999) on Yarn

2015-12-22 Thread prasadreddy
How do we achieve this on yarn-cluster mode Please advice. Thanks Prasad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/val-listRDD-ssc-socketTextStream-localhost--on-Yarn-tp25760.html Sent from the Apache Spark User List mailing list archive at

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
David, can you verify that mysql connector classes indeed in your single jar? open it with zip tool available at your platform another options that might be a problem - if there is some dependency in MANIFEST(not sure though this is the case of mysql connector) then it might be broken after

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
Hi Jan, Is the error because a past run of the job has already written to the location? In that case you can add more granularity with 'time' along with year and month. That should give you a distinct path for every run. Let us know if it helps or if i missed anything. Goodluck - Thanks, via

Client session timed out, have not heard from server in

2015-12-22 Thread yaoxiaohua
Hi, I encounter a similar question, spark1.4 Master2 run some days , then give a timeout exception, then shutdown. I found a bug : https://issues.apache.org/jira/browse/SPARK-9629 INFO ClientCnxn: Client session timed out, have not heard from server in 40015ms for sessionid

Re: Apache spark certification pass percentage ?

2015-12-22 Thread Yash Sharma
Hi Sri, That would depend on the organization from where you are applying the certification. This place would be more helpful where you can ask about questions and information about using spark and/or contributing to spark. Goodluck - Thanks, via mobile, excuse brevity. On Dec 22, 2015 3:56

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma
Hi Evan, SPARK-9629 referred to connection issues with zookeeper. Could you check if its working fine in your setup. Also please share other error logs you might be getting. - Thanks, via mobile, excuse brevity. On Dec 22, 2015 5:00 PM, "yaoxiaohua" wrote: > Hi, > >

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Dirceu Semighini Filho
Hi Yash, I've experienced this behavior here when the process freeze in a worker. This mainly happen, in my case, when the worker memory was full and the java GC wasn't able to free memory for the process. Try to search for outofmemory error in your worker logs. Regards, Dirceu 2015-12-22 10:26

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma
Evan could you also share more logs on the error. Probably paste here or in pastebin. Also check zookeeper logs in case you find anything. - Thanks, via mobile, excuse brevity. On Dec 22, 2015 6:01 PM, "Dirceu Semighini Filho" < dirceu.semigh...@gmail.com> wrote: > Hi Yash, > I've experienced

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
Hi Yash, the error is caused by the fact that first run creates the base directory ie. "/tmp/data" and the second batch stumbles to the existing base directory. I understand that the existing base directory is a challenge but I do not understand how to make this work with streaming example

RE: Client session timed out, have not heard from server in

2015-12-22 Thread yaoxiaohua
Thanks for your reply. I find spark-env.sh : SPARK_JAVA_OPTS="$SPARK_JAVA_OPTS -Dspark.akka.askTimeout=300 -Dspark.ui.retainedStages=1000 -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://sparkcluster/user/spark_history_logs -Dspark.shuffle.spill=false -Dspark.shuffle.manager=hash

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-22 Thread Priya Ch
Jakob, Increased the settings like fs.file-max in /etc/sysctl.conf and also increased user limit in /etc/security/limits.conf. But still see the same issue. On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky wrote: > It might be a good idea to see how many files are open

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
imho, if you succeeded to fetch something from your mysql with same jar in classpath, then Manifest is ok and you indeed should look at your spark sql - jdbc configs On 22 December 2015 at 12:21, David Yerrington wrote: > Igor, I think it's available. After I extract the

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
Well this will indeed hit the error if the next run has similar year and months and writing would not be possible. You can try working around by introducing a runCount in partition or in the output path. Something like- /tmp/data/year/month/01 /tmp/data/year/month/02 Or,

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma
Well you are right. Having a quick glance at the source[1] I see that the path creation does not consider the partitions. It tries to create the path before looking for partitions columns. Not sure what would be the best way to incorporate it. Probably you can file a jira and experienced

Tips for Spark's Random Forest slow performance

2015-12-22 Thread Alexander Ratnikov
Hi All, It would be good to get some tips on tuning Apache Spark for Random Forest classification. Currently, we have a model that looks like: featureSubsetStrategy all impurity gini maxBins 32 maxDepth 11 numberOfClasses 2 numberOfTrees 100 We are running Spark 1.5.1 as a standalone cluster.

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg
In my example directories were distinct. So If I would like to have to distinct directories ex. /tmp/data/year=2012 /tmp/data/year=2013 It does not work with val df = Seq((2012, "Batman")).toDF("year","title") df.write.partitionBy("year").avro("/tmp/data") val df2 = Seq((2013,

Getting EOFException when using cloudera built spark 1.5.0.

2015-12-22 Thread hokam chauhan
Hi, I am getting EOFException while loading MLLib models when running the spark app on cloudera built spark cluster. This error goes away when I use apache hadoop built spark and runs the app on it. Please help why it is failing on cloudera spark bundle. Thanks, Hokam -- View this message in

How to handle categorical variables in Spark MLlib?

2015-12-22 Thread hokam chauhan
Hi, We have one use case in which we need to handle the categorical variables in SVM, Regression and Logistic regression models(MLlib not ML) for scoring. We are getting the possible category values against each category variable. So how the string value of categorical variable can be

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi
I narrowed it down to problem described for example here:  https://bugs.openjdk.java.net/browse/JDK-6293787 It is the mass finalization of zip Inflater/Deflater objects which can't keep up with the rate of these instances being garbage collected. as the jdk bug report (not being accepted as a

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu
I searched code briefly. The following uses ZipEntry, ZipOutputStream : core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala FYI On Tue, Dec 22, 2015 at 9:16 AM, Antony Mayi

Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar
Hi, I have a standalone cluster. One Master + One Slave. I'm getting below "NULL POINTER" exception. Could you please help me on this issue. *Code Block :-* val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) *==> This line giving exception.* Exception :-

Regarding spark in nemory

2015-12-22 Thread Gaurav Agarwal
If I have 3 more cluster and spark is running there .if I load the records from phoenix to spark rdd and fetch the records from the spark through data frame. Now I want to know that spark is distributed? So I fetch the records from any of the node, records will be retrieved present on any node

Re: Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl
i solove this now; just run 'refresh table shop.id' on beeline; -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-read-data-correctly-through-beeline-when-data-is-save-by-HiveContext-tp25774p25779.html Sent from the Apache Spark User List mailing list

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu
Which Spark release are you using ? Cheers On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help me on this issue. > > >

RE: Do existing R packages work with SparkR data frames

2015-12-22 Thread Sun, Rui
Hi, Lan, Generally, it is hard to use existing R packages working with R data frames to work with SparkR data frames transparently. Typically the algorithms have to be re-written to use SparkR DataFrame API. Collect is for collecting the data from a SparkR DataFrame into a local data.frame.

Re: configure spark for hive context

2015-12-22 Thread Akhil Das
Looks like you put a wrong configuration file which crashed spark to parse the configuration values from it. Thanks Best Regards On Mon, Dec 21, 2015 at 3:35 PM, Divya Gehlot wrote: > Hi, > I am trying to configure spark for hive context (Please dont get mistaken >

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com
Dean, the following code test pasted . Thank for you again. if (args.length < 2) { System.err.println("Usage: StatefulNetworkWordBiggest3Vaules ") System.exit(1) } val updateFunc = (key:String,values: Seq[Seq[Int]], state: Option[Seq[Int]]) => { if(values.length>0){