RE: Spark 1.2.0 with Play/Activator

2015-04-06 Thread Manish Gupta 8
Thanks for the information Andy. I will go through the versions mentioned in Dependencies.scala to identify the compatibility. Regards, Manish From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, April 07, 2015 11:04 AM To: Manish Gupta 8; user@spark.apache.org Subject: Re: Spark

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread twinkle sachdeva
Hi, One of the rational behind killing the app can be to avoid skewness in data. I have created this issue (https://issues.apache.org/jira/browse/SPARK-6735) to provide options for disabling this behaviour, as well as making the number of executor's failure to be relative with respect to a window

Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Bipin Nag
Thanks for the information. Hopefully this will happen in near future. For now my best bet would be to export data and import it in spark sql. On 7 April 2015 at 11:28, Denny Lee wrote: > At this time, the JDBC Data source is not extensible so it cannot support > SQL Server. There was some tho

Re: Microsoft SQL jdbc support from spark sql

2015-04-06 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support SQL Server. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin wrote: > Hi, I am try

Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin
Hi, I am trying to pull data from ms-sql server. I have tried using the spark.sql.jdbc CREATE TEMPORARY TABLE c USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;", dbtable "Customer" ); But it shows java.sql.SQLException: No suitable driver fou

graphx running time

2015-04-06 Thread daze5112
Hi im currently using graphx for some analysis and have come into a bit of a hurdle. If use my test dataset of 20 nodes and about 30 links it runs really quickly. I have two other data sets i use one of 10million links and one of 20 million. When i create my graphs seems to work okay and i can get

Re: Spark 1.2.0 with Play/Activator

2015-04-06 Thread andy petrella
Hello Manish, you can take a look at the spark-notebook build, it's a bit tricky to get rid of some clashes but at least you can refer to this build to have ideas. LSS, I have stripped out akka from play deps. ref: https://github.com/andypetrella/spark-notebook/blob/master/build.sbt https://githu

Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
My hunch is that this behavior was introduced by a patch to start shading Jetty in Spark 1.3: https://issues.apache.org/jira/browse/SPARK-3996. Note that Spark's *MetricsSystem* class is marked as *private[spark]* and thus isn't intended to be interacted with directly by users. It's not super lik

Spark 1.2.0 with Play/Activator

2015-04-06 Thread Manish Gupta 8
Hi, We are trying to build a Play framework based web application integrated with Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as Activator 1.3.2. If any

Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-06 Thread Hari Polisetty
 My application is running Spark in local mode and  I have a Spark Streaming Listener as well as a Custom Receiver. When the receiver is done fetching all documents, it invokes “stop” on itself. I see the StreamingListener  getting a callback on “onReceiverStopped” where I stop the streaming co

Re: Super slow caching in 1.3?

2015-04-06 Thread Michael Armbrust
Do you think you are seeing a regression from 1.2? Also, are you caching nested data or flat rows? The in-memory caching is not really designed for nested data and so performs pretty slowly here (its just falling back to kryo and even then there are some locking issues). If so, would it be possi

Processing Large Images in Spark?

2015-04-06 Thread patrick.mckendree.young
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-06 Thread Mohammed Guller
Sure, will do. I may not be able to get to it until next week, but will let you know if I am able to the crack the code. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Friday, April 3, 2015 5:52 PM To: Mohammed Guller Cc: pawan kumar; user@spark.apache.org Subject: Re: Tableau + Spar

Super slow caching in 1.3?

2015-04-06 Thread Christian Perez
Hi all, Has anyone else noticed very slow time to cache a Parquet file? It takes 14 s per 235 MB (1 block) uncompressed node local Parquet file on M2 EC2 instances. Or are my expectations way off... Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.co

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
On 7 April 2015 at 04:03, Dean Wampler wrote: > > On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele > wrote: > >> Thanks a lot.That means Spark does not support the nested RDD? >> if I pass the javaSparkContext that also wont work. I mean passing >> SparkContext not possible since its not serial

Re: task not serialize

2015-04-06 Thread Dean Wampler
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele wrote: > Thanks a lot.That means Spark does not support the nested RDD? > if I pass the javaSparkContext that also wont work. I mean passing > SparkContext not possible since its not serializable > > ​That's right. RDD don't nest and SparkContexts

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable i have a requirement where I will get JavaRDD matchRdd and I need to return the postential matches for this record

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
The compilation happens in parallel on all of the machines, so its not really clear that there is a win to generating it on the driver and shipping it from a latency perspective. However, really I just took the easiest path that didn't require more bytecode extracting / shipping machinery. On Mon

Re: Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Thanks for the info, Michael. Is there a reason to do so, as opposed to shipping out the bytecode and loading it via the classloader? Is it more complex? I can imagine caching to be effective for repeated queries, but when the subsequent queries are different. On Mon, Apr 6, 2015 at 2:41 PM, Mi

Processing Large Images in Spark?

2015-04-06 Thread Patrick Young
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (

Re: task not serialize

2015-04-06 Thread Dean Wampler
The "log" instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connec

Re: Spark Druid integration

2015-04-06 Thread Michael Armbrust
You could certainly build a connector, but it seems like you would want support for pushing down aggregations to get the benefits of Druid. There are only experimental interfaces for doing so today, but it sounds like a pretty cool project. On Mon, Apr 6, 2015 at 2:23 PM, Paolo Platter wrote: >

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Michael Armbrust
> > Which caused some issues with the BI tool I'm using, Tableau. It would > show the table under tables, but when selected would throw an exception. > Removing the "TEMPORARY" appears to address that problem. > Hmm, can you provide the exception? We made some changes in 1.3 specifically to allow

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
It is generated and cached on each of the executors. On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya wrote: > Hi, > > I'm curious as to how Spark does code generation for SQL queries. > > Following through the code, I saw that an expression is parsed and > compiled into a class using Scala reflect

Broadcast value return empty after turn to org.apache.spark.serializer.KryoSerializer

2015-04-06 Thread Shuai Zheng
Hi All, I have tested my code without problem on EMR yarn (spark 1.3.0) with default serializer (java). But when I switch to org.apache.spark.serializer.KryoSerializer, the broadcast value doesn't give me right result (actually return me empty custom class on inner object). Basically I bro

Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using Scala reflection toolbox. However, it's unclear to me whether the actual byte code is generated on the master or on each of the exe

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist
Hi Michael, Yes that was a typo. Thanks for the pointer, it works now. I had tried that but had one thing off. I had CREATE TEMPORARY TABLE tableName USING parquet OPTIONS ( path '/path/to/file' ) Which caused some issues with the BI tool I'm using, Tableau. It would show the table under t

Spark Druid integration

2015-04-06 Thread Paolo Platter
Hi, Do you think it is possible to build an integration beetween druid and spark, using Datasource API ? Is someone investigating this kind of solution ? I think that Spark SQL could fill the lack of a complete SQL Layer of Druid. It could be a great OLAP solution. WDYT ? Paolo Platter AgileLab

SparkSQL + Parquet performance

2015-04-06 Thread Paolo Platter
Hi all, is there anyone using SparkSQL + Parquet that has made a benchmark about storing parquet files on HDFS or on CFS ( Cassandra File System )? What storage can improve performance of SparkSQL+ Parquet ? Thanks Paolo

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
I'll add that I don't think there is a convenient way to do this in the Column API ATM, but would welcome a JIRA for adding it :) On Mon, Apr 6, 2015 at 1:45 PM, Michael Armbrust wrote: > In HiveQL, you should be able to express this as: > > SELECT ... FROM table GROUP BY m['SomeKey'] > > On Sat

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY m['SomeKey'] On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip wrote: > Hello, > > I have a case class like this: > > case class A( > m: Map[Long, Long], > ... > ) > > and constructed a DataFrame from Seq[A]. > > I wo

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Thanks. I’ll look into it. But the JSON string I push via receiver goes through a series of transformations, before it ends up in the final RDD. I need to take care to ensure that this magic value propagates all the way down to the last one that I’m iterating on. Currently, I’m calling “stop" f

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Here is the command that I have used : spark-submit —class packagename.ALSNew --num-executors 100 --master yarn ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path Btw - I could run the old ALS in mllib package. On Apr 6, 2015, at 12:32 PM, Xiangrui Meng wrote: > So ALSNew.scala

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Michael Armbrust
Hey Todd, In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet > is no longer public, so the above no longer works. This was probably just a typo, but to be clear, spark.sql.hive.convertMetastoreParquet is still a supported option and should work. You are correct that th

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
What's the advantage of killing an application for lack of resources? I think the rationale behind killing an app based on executor failures is that, if we see a lot of them in a short span of time, it means there's probably something going wrong in the app or on the cluster. On Wed, Apr 1, 2015

task not serialize

2015-04-06 Thread Jeetendra Gangele
In this code in foreach I am getting task not serialized exception @SuppressWarnings("serial") public static void matchAndMerge(JavaRDD matchRdd, final JavaSparkContext jsc) throws IOException{ log.info("Company matcher started"); //final JavaSparkContext jsc = getSparkContext(); matchRdd.fo

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a "magic value" when it is done. I discuss this Spark Streaming pattern in my presentation "Spark Gotchas and Anti-Patterns". In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Yes, I’m using updateStateByKey and it works. But then I need to perform further computation on this Stateful RDD (see code snippet below). I perform forEach on the final RDD and get the top 10 records. I just don’t want the foreach to be performed every time a new batch is received. Only when t

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Tathagata Das
So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey ( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.s

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
Cc'ing Chris Fregly, who wrote the Kinesis integration. Maybe he can help. On Mon, Apr 6, 2015 at 9:23 AM, Vadim Bichutskiy wrote: > Hi all, > > I am wondering, has anyone on this list been able to successfully > implement Spark on top of Kinesis? > > Best, > Vadim > ᐧ > > On Sun, Apr 5, 2015 at

Re: WordCount example

2015-04-06 Thread Tathagata Das
There are no workers registered with the Spark Standalone master! That is the crux of the problem. :) Follow the instructions properly - https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts Especially make the conf/slaves file has intended workers listed. TD On Mon, A

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
So ALSNew.scala is your own application, did you add it with spark-submit or spark-shell? The correct command should like spark-submit --class your.package.name.ALSNew ALSNew.jar [options] Please check the documentation: http://spark.apache.org/docs/latest/submitting-applications.html -Xiangrui

How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
I have created a Custom Receiver to fetch records pertaining to a specific query from Elastic Search and have implemented Streaming RDD transformations to process the data generated by the receiver. The final RDD is a sorted list of name value pairs and I want to read the top 20 results progra

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Hi, Here is the stack trace: Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at ALSNew$.main(ALSNew.scala:35) at ALSNew.main(ALSNew.scala) at sun.refl

Re: How to work with sparse data in Python?

2015-04-06 Thread Xiangrui Meng
We support sparse vectors in MLlib, which recognizes MLlib's sparse vector and SciPy's csc_matrix with a single column. You can create RDD of sparse vectors for your data and save/load them to/from parquet format using dataframes. Sparse matrix supported will be added in 1.4. -Xiangrui On Mon, Apr

Re: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Xiangrui Meng
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If that is the case, you need to cast them explicitly: rdd.map { case (label, features) => LabeledPoint(label, features) } -Xiangrui On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact wrote: > Hello Sparkers, > > I kept getting this

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
Please attach the full stack trace. -Xiangrui On Mon, Apr 6, 2015 at 12:06 PM, Jay Katukuri wrote: > > Hi all, > > I got a runtime error while running the ALS. > > Exception in thread "main" java.lang.NoSuchMethodError: > scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala

Re: DataFrame -- help with encoding factor variables

2015-04-06 Thread Xiangrui Meng
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF to do the mapping. val labelToIndex = udf { ... } featureDF.withColumn("f3_dummy", labelToIndex(col("f3"))) See instructions here http://spark.apache.org/docs/latest/sql-programming-guide.html#udf-registration-moved-to-sqlconte

org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Hi all, I got a runtime error while running the ALS. Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; The error that I am getting is at the following code: val ratings = pu

java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Joanne Contact
Hello Sparkers, I kept getting this error: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint I have tried the following to convert v._1 to double: Method 1: (if(v._1>0) 1d else 0d) Method 2: def bool2Double(b:Boolean): Double = { if

Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist
In 1.2.1 of I was persisting a set of parquet files as a table for use by spark-sql cli later on. There was a post here by Mchael Armbrust that provide a nice little helper method for dealing

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-06 Thread Jeetendra Gangele
I hit again same issue This time I tried to return the Object it failed with task not serialized below is the code here vendor record is serializable private static JavaRDD getVendorDataToProcess(JavaSparkContext sc) throws IOException { return sc .newAPIHadoopRDD(getVendorDataRowKeyScannerCo

Re: Spark unit test fails

2015-04-06 Thread Manas Kar
Trying to bump up the rank of the question. Any example on Github can someone point to? ..Manas On Fri, Apr 3, 2015 at 9:39 AM, manasdebashiskar wrote: > Hi experts, > I am trying to write unit tests for my spark application which fails with > javax.servlet.FilterRegistration error. > > I am u

Re: Spark Avarage

2015-04-06 Thread baris akgun
Thanks for your replies I solved the problem with this code val weathersRDD = sc.textFile(csvfilePath).map { line => val Array(dayOfdate, minDeg, maxDeg, meanDeg) = line.replaceAll("\"","").trim.split(",") Tuple2(dayOfdate.substring(0,7), (minDeg.toInt, maxDeg.toInt, meanDeg.toInt)) }.ma

RE: Spark Avarage

2015-04-06 Thread Cheng, Hao
The Dataframe API should be perfectly helpful in this case. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html Some code snippet will like: val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implici

Re: From DataFrame to LabeledPoint

2015-04-06 Thread Joseph Bradley
I'd make sure you're selecting the correct columns. If not that, then your input data might be corrupt. CCing user to keep it on the user list. On Mon, Apr 6, 2015 at 6:53 AM, Sergio Jiménez Barrio wrote: > Hi!, > > I had tried your solution, and I saw that the first row is null. This is > imp

Re: Using DIMSUM with ids

2015-04-06 Thread Reza Zadeh
Right now dimsum is meant to be used for tall and skinny matrices, and so columnSimilarities() returns similar columns, not rows. We are working on adding an efficient row similarity as well, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Reza On Mon, Apr 6, 2015 at 6:08 AM

Re: WordCount example

2015-04-06 Thread Mohit Anchlia
Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das wrote: > What does the Spark Standalone UI at port 8080 say about number of cores? > > On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia > wrote: > >> [ec2-user@ip-10-241-251-232 s_l

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska
If you're going to do it this way, I would ouput dayOfdate.substring(0,7), i.e. the month part, and instead of weatherCond, you can use (month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD: RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in multiple Spark examples..Y

Re: Spark + Kinesis

2015-04-06 Thread Vadim Bichutskiy
Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy wrote: > ᐧ > Hi all, > > Below is the output that I am getting. My Kinesis stream has 1 shard, and > my Spark cluster on E

Re: Spark Streaming 1.3 & Kafka Direct Streams

2015-04-06 Thread Neelesh
Somewhat agree on subclassing and its issues. It looks like the alternative in spark 1.3.0 to create a custom build. Is there an enhancement filed for this? If not, I'll file one. Thanks! -neelesh On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das wrote: > The challenge of opening up these internal

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
Thanks Nan. I was searching for RowFactory.scala Cheers On Mon, Apr 6, 2015 at 7:52 AM, Nan Zhu wrote: > Hi, Ted > > It’s here: > https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java > > Best, > > -- > Na

How to work with sparse data in Python?

2015-04-06 Thread SecondDatke
I'm trying to apply Spark to a NLP problem that I'm working around. I have near 4 million tweets text and I have converted them into word vectors. It's pretty sparse because each message just has dozens of words but the vocabulary has tens of thousand words. These vectors should be loaded each t

Spark Avarage

2015-04-06 Thread barisak
Hi I have a class in above desc. case class weatherCond(dayOfdate: String, minDeg: Int, maxDeg: Int, meanDeg: Int) I am reading the data from csv file and I put this data into weatherCond class with this code val weathersRDD = sc.textFile("weather.csv").map { line => val Array(d

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu
Hi, Ted It’s here: https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java Best, -- Nan Zhu http://codingcat.me On Monday, April 6, 2015 at 10:44 AM, Ted Yu wrote: > I searched code base but didn't fi

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 ` After building Spark with command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package I try to run Pi example on YARN with the following command: export HADOOP_CONF_DIR=/etc/had

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
I searched code base but didn't find RowFactory class. Pardon me. On Mon, Apr 6, 2015 at 7:39 AM, Ted Yu wrote: > From scaladoc > of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala : > > * To create a new Row, use [[RowFactory.create()]] in Java or > [[Row.apply()]] in Scala. > * >

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu
Row class was not documented mistakenly in 1.3.0 you can check the 1.3.1 API doc http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/api/scala/index.html#org.apache.spark.sql.Row Best, -- Nan Zhu http://codingcat.me On Monday, April 6, 2015 at 10:23 AM, ARose wrote: > I am trying to ca

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
>From scaladoc of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala : * To create a new Row, use [[RowFactory.create()]] in Java or [[Row.apply()]] in Scala. * Cheers On Mon, Apr 6, 2015 at 7:23 AM, ARose wrote: > I am trying to call Row.create(object[]) similarly to what's shown in

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska
Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded. I see https://issues.apache.org/jira/browse/S

What happened to the Row class in 1.3.0?

2015-04-06 Thread ARose
I am trying to call Row.create(object[]) similarly to what's shown in this programming guide , but the create() method is no longer recognized. I tried to look up the documentation for the Ro

Re: Learning Spark

2015-04-06 Thread Ted Yu
bq. I need to know on what all databases You can access HBase using Spark. Cheers On Mon, Apr 6, 2015 at 5:59 AM, Akhil Das wrote: > We had few sessions at Sigmoid, you could go through the meetup page for > details: > > http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/ > On

Re: (send this email to subscribe)

2015-04-06 Thread Ted Yu
Please send email to user-subscr...@spark.apache.org On Mon, Apr 6, 2015 at 6:52 AM, 林晨 wrote: > >

(send this email to subscribe)

2015-04-06 Thread 林晨

Using DIMSUM with ids

2015-04-06 Thread James
The example below illustrates how to use the DIMSUM algorithm to calculate the similarity between each two rows and output row pairs with cosine simiarity that is not less than a threshold. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSi

Re: Learning Spark

2015-04-06 Thread Akhil Das
We had few sessions at Sigmoid, you could go through the meetup page for details: http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/ On 6 Apr 2015 18:01, "Abhideep Chakravarty" < abhideep.chakrava...@mindtree.com> wrote: > Hi all, > > > > We are here planning to setup a Spark

RDD generated on every query

2015-04-06 Thread Siddharth Ubale
Hi , In Spark Web Application the RDD is generating every time client is sending a query request. Is there any way where the RDD is compiled once and run query again and again on active SparkContext? Thanks, Siddharth Ubale, Synchronized Communications #43, Velankani Tech Park, Block No. II, 3r

Learning Spark

2015-04-06 Thread Abhideep Chakravarty
Hi all, We are here planning to setup a Spark learning session series. I need all of your input to create a TOC for this program i.e. what all to cover if we need to start from basics and upto what we should go to cover all the aspects of Spark in details. Also, I need to know on what all dat

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork Sail
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 ` After building Spark with command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package I try to run Pi example on YARN with the following command: export HADOOP_CONF_DIR=/etc/had

Re: Cannot build "learning spark" project

2015-04-06 Thread Sean Owen
(This mailing list concerns Spark itself rather than the book about Spark. Your question is about building code that isn't part of Spark, so, the right place to ask is https://github.com/databricks/learning-spark You have a typo in "pachage" but I assume that's just your typo in this email.) On M

Cannot build "learning spark" project

2015-04-06 Thread Adamantios Corais
Hi, I am trying to build this project https://github.com/databricks/learning-spark with mvn package.This should work out of the box but unfortunately it doesn't. In fact, I get the following error: mvn pachage -X > Apache Maven 3.0.5 > Maven home: /usr/share/maven > Java version: 1.7.0_76, vendor

Re: Write to Parquet File in Python

2015-04-06 Thread Akriti23
Thank you so much for your reply. We would like to provide a tool to the user to convert a binary file to a file in Avro/Parquet format on his own computer. The tool will parse binary file in python, and convert the data to Parquet. (BTW can we append to parquet file). The issue is that we do not

Re: Low resource when upgrading from 1.1.0 to 1.3.0

2015-04-06 Thread Roy.Wang
I also meet the same problem. I deploy and run spark(version:1.3.0) on local mode. when i run a simple app that counts lines of a file, the console prints "TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient

Re: Sending RDD object over the network

2015-04-06 Thread Raghav Shankar
Hey Akhil, Thanks for your response! No, I am not expecting to receive the values themselves. I am just trying to receive the RDD object on my second Spark application. However, I get a NPE when I try to use the object within my second program. Would you know how I can properly send the RDD objec

Re: Sending RDD object over the network

2015-04-06 Thread Akhil Das
Are you expecting to receive 1 to 100 values in your second program? RDD is just an abstraction, you would need to do like: num.foreach(x => send(x)) Thanks Best Regards On Mon, Apr 6, 2015 at 1:56 AM, raggy wrote: > For a class project, I am trying to utilize 2 spark Applications > communic