date:20151222

How to handle categorical variables in Spark MLlib?

2015-12-22 Thread Hokam Singh Chauhan

Hi, We have one use case in which we need to handle the categorical variables in SVM, Regression and Logistic regression models(MLlib not ML) for scoring. We are getting the possible category values against each category variable. So how the string value of categorical variable can be converted

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Arunkumar Pillai

Hi Yanbo, This is just to find f stats and adjusted Rsquare. On Tue, Dec 22, 2015 at 4:25 PM, Yanbo Liang wrote: > Hi Arunkumar, > > Could you tell me the cause of getting SSerr, SStot and SSreg? > Traditional we use explainedVariance, meanAbsoluteError, meanSquaredError, >

Re: Classification model method not found

2015-12-22 Thread Ted Yu

Looks like you should define ctor for ExtendedLR which accepts String (the uid). Cheers On Tue, Dec 22, 2015 at 1:04 PM, njoshi wrote: > Hi, > > I have a custom extended LogisticRegression model which I want to test > against a parameter grid search. I am running as

How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-22 Thread raja kbv

Hi, I am new to spark. I have a text filewith below structure. (employeeID: Int, Name: String, ProjectDetails:JsonObject{[{ProjectName, Description, Duriation, Role}]})Eg:(123456, Employee1, {“ProjectDetails”:[ {“ProjectName”: “Web

building a distributed k-d tree with spark

2015-12-22 Thread Russ

I have developed algorithms to build and search a distributed k-d tree using Scala/Spark. Erik Erlandson of Red Hat has encouraged me to share the URL for the pre-print of my manuscript and source code: http://arxiv.org/abs/1512.06389 I will be revising this manuscript and refining the

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg

Works like a charm. Thanks a lot! -jan On 22 Dec 2015, at 23:08, Michael Armbrust > wrote: You need to say .mode("append") if you want to append to existing data. On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma

Re: spark-submit is ignoring "--executor-cores"

2015-12-22 Thread Siva

Thanks a lot Saisai and Zhan, I see DefaultResourceCalculator currently being used for Capacity scheduler. We will change it to DominantResourceCalculator. Thanks, Sivakumar Bhavanari. On Mon, Dec 21, 2015 at 5:56 PM, Zhan Zhang wrote: > BTW: It is not only a Yarn-webui

Classification model method not found

2015-12-22 Thread njoshi

Hi, I have a custom extended LogisticRegression model which I want to test against a parameter grid search. I am running as follows: / val exLR = new ExtendedLR() .setMaxIter(100) .setFitIntercept(true) /* * Cross Validator parameter grid */ val paramGrid = new

Missing dependencies when submitting scala app

2015-12-22 Thread Daniel Valdivia

Hi, I'm trying to figure out how to bundle dependendies with a scala application, so far my code was tested successfully on the spark-shell however now that I'm trying to run it as a stand alone application which I'm compilin with sbt is yielding me the error: java.lang.NoSuchMethodError:

Re: Spark data frame

2015-12-22 Thread Dean Wampler

More specifically, you could have TBs of data across thousands of partitions for a single RDD. If you call collect(), BOOM! Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Do existing R packages work with SparkR data frames

2015-12-22 Thread Lan

Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Re: should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()

2015-12-22 Thread Jeff Zhang

>>> DataFrame transformerdDF = df.withColumn(fieldName, newCol); >>> org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS transformedByUDF#3]; I don't think it's a spark bug, it is your application bug. I

Which Hive version should be used with Spark 1.5.2?

2015-12-22 Thread Arthur Chan

Hi, I plan to upgrade from 1.4.1 (+ Hive 1.1.0) to 1.5.2, is there any upgrade document available about the upgrade especially which Hive version should be upgraded too? Regards

Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread b2k70

I see in the Spark SQL documentation that a temporary table can be created directly onto a remote PostgreSQL table. CREATE TEMPORARY TABLE USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:postgresql:///", dbtable "impressions" ); When I run this against our PostgreSQL server, I get the

Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl

Hi， Here is my javacode; SparkConf sparkConf = Constance.getSparkConf(); JavaSparkContext sc = new JavaSparkContext(sparkConf); SQLContext sql = new SQLContext(sc); HiveContext sqlContext = new HiveContext(sc.sc()); List fields = new

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch

The postgres jdbc driver needs to be added to the classpath of your spark workers. You can do a search for how to do that (multiple ways). 2015-12-22 17:22 GMT-08:00 b2k70 : > I see in the Spark SQL documentation that a temporary table can be created > directly onto a

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Zhan Zhang

SQLContext is in driver side, and I don’t think you can use it in executors. How to provide lookup functionality in executors really depends on how you would use them. Thanks. Zhan Zhang On Dec 22, 2015, at 4:44 PM, SRK wrote: > Hi, > > Can SQL Context be used

Re: Can SqlContext be used inside mapPartitions

2015-12-22 Thread Ted Yu

bq. be able to lookup from inside MapPartitions based on a key Please describe your use case in bit more detail. One possibility is to use NoSQL database such as HBase. There're several choices for Spark HBase connector. Cheers On Tue, Dec 22, 2015 at 4:51 PM, Zhan Zhang

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim

Hi Stephen, I forgot to mention that I added these lines below to the spark-default.conf on the node with Spark SQL Thrift JDBC/ODBC Server running on it. Then, I restarted it. spark.driver.extraClassPath=/usr/share/java/postgresql-9.3-1104.jdbc41.jar

Re: Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl

I know the reason now I change the metastore with javacode But the thriftserver cache the metastore in memory,it need refresh from the mysql; but how?? -- View this message in context:

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar

Hi Ted, Thank you. Yes. This issue is related to https://issues.apache.org/jira/browse/SPARK-4170 Regards, Rajesh On Wed, Dec 23, 2015 at 12:09 AM, Ted Yu wrote: > This should be related: > https://issues.apache.org/jira/browse/SPARK-4170 > > On Tue, Dec 22, 2015 at 9:34

Re: Which Hive version should be used with Spark 1.5.2?

2015-12-22 Thread Ted Yu

Please see SPARK-8064 On Tue, Dec 22, 2015 at 6:17 PM, Arthur Chan wrote: > Hi, > > I plan to upgrade from 1.4.1 (+ Hive 1.1.0) to 1.5.2, is there any > upgrade document available about the upgrade especially which Hive version > should be upgraded too? > > Regards >

Re: Classification model method not found

2015-12-22 Thread Nikhil Joshi

Hi Ted, Thanks. That fixed the issue :). Nikhil On Tue, Dec 22, 2015 at 1:14 PM, Ted Yu wrote: > Looks like you should define ctor for ExtendedLR which accepts String > (the uid). > > Cheers > > On Tue, Dec 22, 2015 at 1:04 PM, njoshi wrote: >

Can SqlContext be used inside mapPartitions

2015-12-22 Thread SRK

Hi, Can SQL Context be used inside mapPartitions? My requirement is to register a set of data from hdfs as a temp table and to be able to lookup from inside MapPartitions based on a key. If it is not supported, is there a different way of doing this? Thanks! -- View this message in context:

Re: Missing dependencies when submitting scala app

2015-12-22 Thread Jeff Zhang

It might be jar conflict issue. Spark has dependency org.json4s.jackson, do you also specify org.json4s.jackson in your sbt dependency but with a different version ? On Wed, Dec 23, 2015 at 6:15 AM, Daniel Valdivia wrote: > Hi, > > I'm trying to figure out how to bundle

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Stephen Boesch

HI Benjamin, yes by adding to the thrift server then the create table would work. But querying is performed by the workers: so you need to add to the classpath of all nodes for reads to work. 2015-12-22 18:35 GMT-08:00 Benjamin Kim : > Hi Stephen, > > I forgot to mention

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-22 Thread Benjamin Kim

Stephen, Let me confirm. I just need to propagate these settings I put in spark-defaults.conf to all the worker nodes? Do I need to do the same with the PostgreSQL driver jar file too? If so, is there a way to have it read from HDFS rather than copying out to the cluster manually. Thanks for

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com

So sorry , should be Seq, not sql . thanks for your help. Ricky Ou(欧锐) From: Dean Wampler Date: 2015-12-23 00:46 To: our...@cnsuning.com CC: user; t...@databricks.com Subject: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list? There are

Re: Spark data frame

2015-12-22 Thread Dean Wampler

You can call the collect() method to return a collection, but be careful. If your data is too big to fit in the driver's memory, it will crash. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Michael Armbrust

You need to say .mode("append") if you want to append to existing data. On Tue, Dec 22, 2015 at 6:48 AM, Yash Sharma wrote: > Well you are right. Having a quick glance at the source[1] I see that the > path creation does not consider the partitions. > > It tries to create

Re: Spark data frame

2015-12-22 Thread Silvio Fiorito

Michael, collect will bring down the results to the driver JVM, whereas the RDD or DataFrame would be cached on the executors (if it is cached). So, as Dean said, the driver JVM needs to have enough memory to store the results of collect. Thanks, Silvio From: Michael Segel

Do existing R packages work with SparkR data frames

2015-12-22 Thread Duy Lan Nguyen

Hello, Is it possible for existing R Machine Learning packages (which work with R data frames) such as bnlearn, to work with SparkR data frames? Or do I need to convert SparkR data frames to R data frames? Is "collect" the function to do the conversion, or how else to do that? Many Thanks, Lan

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu

This should be related: https://issues.apache.org/jira/browse/SPARK-4170 On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help

Spark data frame

2015-12-22 Thread Gaurav Agarwal

We are able to retrieve data frame by filtering the rdd object . I need to convert that data frame into java pojo. Any idea how to do that

Re: val listRDD =ssc.socketTextStream(localhost,9999) on Yarn

2015-12-22 Thread Shixiong Zhu

Just replace `localhost` with a host name that can be accessed by Yarn containers. Best Regards, Shixiong Zhu 2015-12-22 0:11 GMT-08:00 prasadreddy : > How do we achieve this on yarn-cluster mode > > Please advice. > > Thanks > Prasad > > > > -- > View this message in

How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-22 Thread raja kbv

Hi, I am new to Spark. I have a text filewith below structure. (employeeID: Int, Name: String, ProjectDetails:JsonObject{[{ProjectName, Description, Duriation, Role}]})Eg:(123456, Employee1, {“ProjectDetails”:[ {“ProjectName”: “Web

Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-22 Thread superbee84

Hi All, I'm new to Spark. Before I describe the problem, I'd like to let you know the role of the machines that organize the cluster and the purpose of my work. By reading and follwing the instructions and tutorials, I successfully built up a cluster with 7 CentOS-6.5 machines. I installed

running lda in spark throws exception

2015-12-22 Thread Li Li

I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2. it throws exception in line: Matrix topics = ldaModel.topicsMatrix(); But in yarn job history ui, it's successful. What's wrong with it? I submit job with .bin/spark-submit --class Myclass \ --master yarn-client \

running spark application encouter an error (maven relative)

2015-12-22 Thread zml张明磊

Hi, I am trying to figure out how maven works. When I add a dependency to my existing pom.xml and rebuild my spark application project. BUILD SUCCESS I can get from the console. However, when I running the spark application, the spark-shell was not happy and directly give me a message

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das

I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 11:19 PM, Abhinav M Kulkarni <

Re: Regarding spark in nemory

2015-12-22 Thread Ted Yu

If I understand your question correctly, the answer is yes. You can retrieve rows of the rdd which are distributed across the nodes. > On Dec 22, 2015, at 7:34 PM, Gaurav Agarwal wrote: > > If I have 3 more cluster and spark is running there .if I load the records >

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com

as the following code modified form StateflNetwork in exampile package if (args.length < 2) { System.err.println("Usage: StatefulNetworkWordBiggest3Vaules ") System.exit(1) } /** * state is min(max(3)) */ val updateFunc = (key:String,values: Seq[Seq[Int]], state: Seq[Int]) => {

Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-22 Thread Akhil Das

Why not create a custom dstream and generate the data from there itself instead of spark connecting to a socket server which will be fed by another twitter client? Thanks Best Regards On Sat, Dec 19, 2015 at 5:47 PM, Amir

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das

I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 8:33 AM, Abhinav M Kulkarni <

Re: Memory allocation for Broadcast values

2015-12-22 Thread Akhil Das

If you are creating a huge map on the driver, then spark.driver.memory should be set to a higher value to hold your map. Since you are going to broadcast this map, your spark executors must have enough memory to hold this map as well which can be set using the spark.executor.memory, and

error while defining custom schema in Spark 1.5.0

2015-12-22 Thread Divya Gehlot

Hi, I am new bee to Apache Spark ,using CDH 5.5 Quick start VM.having spark 1.5.0. I working on custom schema and getting error import org.apache.spark.sql.hive.HiveContext >> >> scala> import org.apache.spark.sql.hive.orc._ >> import org.apache.spark.sql.hive.orc._ >> >> scala> import

UnsupportedOperationException Schema for type String => Int is not supported

2015-12-22 Thread zml张明磊

Hi, Spark-version : 1.4.1 Runing the code getting the following error, how can I fix the code and run collectly ? I don’t know why the schema don’t support this type system. If I use callUDF instead of udf. Everything is good. Thanks, Minglei. val index:(String => (String => Int)) =

Apache spark certification pass percentage ?

2015-12-22 Thread kali.tumm...@gmail.com

Hi All, Does anyone know pass percentage for Apache spark certification exam ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-spark-certification-pass-percentage-tp25761.html Sent from the Apache Spark User List mailing list archive

driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi

I have streaming app (pyspark 1.5.2 on yarn) that's crashing due to driver (jvm part, not python) OOM (no matter how big heap is assigned, eventually runs out). When checking the heap it is all taken by "byte" items of io.netty.buffer.PoolThreadCache. The number of

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu

This might be related but the jmap output there looks different: http://stackoverflow.com/questions/32537965/huge-number-of-io-netty-buffer-poolthreadcachememoryregioncacheentry-instances On Tue, Dec 22, 2015 at 2:59 AM, Antony Mayi wrote: > I have streaming app

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington

Igor, I think it's available. After I extract the jar file, I see a directory with class files that look very relevant in "/com/mysql/jdbc". After reading this, I started to wonder if MySQL connector was really the problem. Perhaps it's something to do with SQLcontext? I just wired a test

Re: Extract SSerr SStot from Linear Regression using ml package

2015-12-22 Thread Yanbo Liang

Hi Arunkumar, Could you tell me the cause of getting SSerr, SStot and SSreg? Traditional we use explainedVariance, meanAbsoluteError, meanSquaredError, rootMeanSquaredError and r2 as metrics of LinearRegression. Although actually you can get SSerr, SStot and SSreg from the composition of above

Re: Fat jar can't find jdbc

2015-12-22 Thread Vijay Kiran

Can you paste your libraryDependencies from build.sbt ? ./Vijay > On 22 Dec 2015, at 06:12, David Yerrington wrote: > > Hi Everyone, > > I'm building a prototype that fundamentally grabs data from a MySQL instance, > crunches some numbers, and then moves it on down the

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington

Sure here it is: import AssemblyKeys._ assemblySettings // assemblyJarName in assembly := "recommender.jar" test in assembly := {} organization := "com.example" version := "0.1" scalaVersion := "2.11.6" scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8")

Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg

Hi, I'm stuck with writing partitioned data to hdfs. Example below ends up with 'already exists' -error. I'm wondering how to handle streaming use case. What is the intended way to write streaming data to hdfs? What am I missing? cheers, -jan import com.databricks.spark.avro._ import

val listRDD =ssc.socketTextStream(localhost,9999) on Yarn

2015-12-22 Thread prasadreddy

How do we achieve this on yarn-cluster mode Please advice. Thanks Prasad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/val-listRDD-ssc-socketTextStream-localhost--on-Yarn-tp25760.html Sent from the Apache Spark User List mailing list archive at

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman

David, can you verify that mysql connector classes indeed in your single jar? open it with zip tool available at your platform another options that might be a problem - if there is some dependency in MANIFEST(not sure though this is the case of mysql connector) then it might be broken after

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Hi Jan, Is the error because a past run of the job has already written to the location? In that case you can add more granularity with 'time' along with year and month. That should give you a distinct path for every run. Let us know if it helps or if i missed anything. Goodluck - Thanks, via

Client session timed out, have not heard from server in

2015-12-22 Thread yaoxiaohua

Hi, I encounter a similar question, spark1.4 Master2 run some days , then give a timeout exception, then shutdown. I found a bug : https://issues.apache.org/jira/browse/SPARK-9629 INFO ClientCnxn: Client session timed out, have not heard from server in 40015ms for sessionid

Re: Apache spark certification pass percentage ?

2015-12-22 Thread Yash Sharma

Hi Sri, That would depend on the organization from where you are applying the certification. This place would be more helpful where you can ask about questions and information about using spark and/or contributing to spark. Goodluck - Thanks, via mobile, excuse brevity. On Dec 22, 2015 3:56

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma

Hi Evan, SPARK-9629 referred to connection issues with zookeeper. Could you check if its working fine in your setup. Also please share other error logs you might be getting. - Thanks, via mobile, excuse brevity. On Dec 22, 2015 5:00 PM, "yaoxiaohua" wrote: > Hi, > >

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Dirceu Semighini Filho

Hi Yash, I've experienced this behavior here when the process freeze in a worker. This mainly happen, in my case, when the worker memory was full and the java GC wasn't able to free memory for the process. Try to search for outofmemory error in your worker logs. Regards, Dirceu 2015-12-22 10:26

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Yash Sharma

Evan could you also share more logs on the error. Probably paste here or in pastebin. Also check zookeeper logs in case you find anything. - Thanks, via mobile, excuse brevity. On Dec 22, 2015 6:01 PM, "Dirceu Semighini Filho" < dirceu.semigh...@gmail.com> wrote: > Hi Yash, > I've experienced

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg

Hi Yash, the error is caused by the fact that first run creates the base directory ie. "/tmp/data" and the second batch stumbles to the existing base directory. I understand that the existing base directory is a challenge but I do not understand how to make this work with streaming example

RE: Client session timed out, have not heard from server in

2015-12-22 Thread yaoxiaohua

Thanks for your reply. I find spark-env.sh : SPARK_JAVA_OPTS="$SPARK_JAVA_OPTS -Dspark.akka.askTimeout=300 -Dspark.ui.retainedStages=1000 -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://sparkcluster/user/spark_history_logs -Dspark.shuffle.spill=false -Dspark.shuffle.manager=hash

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-22 Thread Priya Ch

Jakob, Increased the settings like fs.file-max in /etc/sysctl.conf and also increased user limit in /etc/security/limits.conf. But still see the same issue. On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky wrote: > It might be a good idea to see how many files are open

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman

imho, if you succeeded to fetch something from your mysql with same jar in classpath, then Manifest is ok and you indeed should look at your spark sql - jdbc configs On 22 December 2015 at 12:21, David Yerrington wrote: > Igor, I think it's available. After I extract the

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Well this will indeed hit the error if the next run has similar year and months and writing would not be possible. You can try working around by introducing a runCount in partition or in the output path. Something like- /tmp/data/year/month/01 /tmp/data/year/month/02 Or,

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Yash Sharma

Well you are right. Having a quick glance at the source[1] I see that the path creation does not consider the partitions. It tries to create the path before looking for partitions columns. Not sure what would be the best way to incorporate it. Probably you can file a jira and experienced

Tips for Spark's Random Forest slow performance

2015-12-22 Thread Alexander Ratnikov

Hi All, It would be good to get some tips on tuning Apache Spark for Random Forest classification. Currently, we have a model that looks like: featureSubsetStrategy all impurity gini maxBins 32 maxDepth 11 numberOfClasses 2 numberOfTrees 100 We are running Spark 1.5.1 as a standalone cluster.

Re: Writing partitioned Avro data to HDFS

2015-12-22 Thread Jan Holmberg

In my example directories were distinct. So If I would like to have to distinct directories ex. /tmp/data/year=2012 /tmp/data/year=2013 It does not work with val df = Seq((2012, "Batman")).toDF("year","title") df.write.partitionBy("year").avro("/tmp/data") val df2 = Seq((2013,

Getting EOFException when using cloudera built spark 1.5.0.

2015-12-22 Thread hokam chauhan

Hi, I am getting EOFException while loading MLLib models when running the spark app on cloudera built spark cluster. This error goes away when I use apache hadoop built spark and runs the app on it. Please help why it is failing on cloudera spark bundle. Thanks, Hokam -- View this message in

How to handle categorical variables in Spark MLlib?

2015-12-22 Thread hokam chauhan

Hi, We have one use case in which we need to handle the categorical variables in SVM, Regression and Logistic regression models(MLlib not ML) for scoring. We are getting the possible category values against each category variable. So how the string value of categorical variable can be

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Antony Mayi

I narrowed it down to problem described for example here: https://bugs.openjdk.java.net/browse/JDK-6293787 It is the mass finalization of zip Inflater/Deflater objects which can't keep up with the rate of these instances being garbage collected. as the jdk bug report (not being accepted as a

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-22 Thread Ted Yu

I searched code briefly. The following uses ZipEntry, ZipOutputStream : core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala FYI On Tue, Dec 22, 2015 at 9:16 AM, Antony Mayi

Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar

Hi, I have a standalone cluster. One Master + One Slave. I'm getting below "NULL POINTER" exception. Could you please help me on this issue. *Code Block :-* val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) *==> This line giving exception.* Exception :-

Regarding spark in nemory

2015-12-22 Thread Gaurav Agarwal

If I have 3 more cluster and spark is running there .if I load the records from phoenix to spark rdd and fetch the records from the spark through data frame. Now I want to know that spark is distributed? So I fetch the records from any of the node, records will be retrieved present on any node

Re: Can't read data correctly through beeline when data is save by HiveContext

2015-12-22 Thread licl

i solove this now; just run 'refresh table shop.id' on beeline; -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-read-data-correctly-through-beeline-when-data-is-save-by-HiveContext-tp25774p25779.html Sent from the Apache Spark User List mailing list

Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Ted Yu

Which Spark release are you using ? Cheers On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a standalone cluster. One Master + One Slave. I'm getting below > "NULL POINTER" exception. > > Could you please help me on this issue. > > >

RE: Do existing R packages work with SparkR data frames

2015-12-22 Thread Sun, Rui

Hi, Lan, Generally, it is hard to use existing R packages working with R data frames to work with SparkR data frames transparently. Typically the algorithms have to be re-written to use SparkR DataFrame API. Collect is for collecting the data from a SparkR DataFrame into a local data.frame.

Re: configure spark for hive context

2015-12-22 Thread Akhil Das

Looks like you put a wrong configuration file which crashed spark to parse the configuration values from it. Thanks Best Regards On Mon, Dec 21, 2015 at 3:35 PM, Divya Gehlot wrote: > Hi, > I am trying to configure spark for hive context (Please dont get mistaken >

Re: Re: spark streaming updateStateByKey state is nonsupport other type except ClassTag such as list?

2015-12-22 Thread our...@cnsuning.com

Dean, the following code test pasted . Thank for you again. if (args.length < 2) { System.err.println("Usage: StatefulNetworkWordBiggest3Vaules ") System.exit(1) } val updateFunc = (key:String,values: Seq[Seq[Int]], state: Option[Seq[Int]]) => { if(values.length>0){

82 matches

Mail list logo