Re: org/I0Itec/zkclient/serialize/ZkSerializer ClassNotFound

2014-10-21 Thread Akhil Das
You can add this jar in the classpath to get ride of this. If you are hitting further exceptions like classNotFound for metrics* etc, then make sure you have all these jars in the classpath: SPARK_CLASSPATH=SPARK_CLASSPATH:

Re: Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-21 Thread Akhil Das
What about start-all.sh or start-slaves.sh? Thanks Best Regards On Tue, Oct 21, 2014 at 10:25 AM, Soumya Simanta wrote: > I'm working a cluster where I need to start the workers separately and > connect them to a master. > > I'm following the instructions here and using branch-1.1 > > http://sp

Re: default parallelism bug?

2014-10-21 Thread Olivier Girardot
Hi, what do you mean by pretty small ? How big is your file ? Regards, Olivier. 2014-10-21 6:01 GMT+02:00 Kevin Jung : > I use Spark 1.1.0 and set these options to spark-defaults.conf > spark.scheduler.mode FAIR > spark.cores.max 48 > spark.default.parallelism 72 > > Thanks, > Kevin > > > > --

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Olivier Girardot
Could you please provide some of your code, and the sample json files you use ? Regards, Olivier. 2014-10-21 5:45 GMT+02:00 tridib : > Hello Experts, > I have two tables build using jsonFile(). I can successfully run join query > on these tables. But once I cacheTable(), all join query fails? >

Re: Convert Iterable to RDD

2014-10-21 Thread Olivier Girardot
I don't think this is provided out of the box, but you can use toSeq on your Iterable and if the Iterable is lazy, it should stay that way for the Seq. And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your RDD. For the Iterable[Iterable[T]] you can flatten it and then create y

Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
If you already know your keys the best way would be to "extract" one RDD per key (it would not bring the content back to the master and you can take advantage of the caching features) and then execute a registerTempTable by Key. But I'm guessing, you don't know the keys in advance, and in this cas

Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar
Hi All, I am trying to run the spark example JavaDecisionTree code using some external data set. It works for certain dataset only with specific maxBins and maxDepth settings. Even for a working dataset if I add a new data item I get a ArrayIndexOutOfBounds Exception, I get the same exception fo

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-21 Thread Fengyun RAO
Thanks, Guilaume, Below is when the exception happens, nothing has spilled to disk yet. And there isn't a join, but a partitionBy and groupBy action. Actually if numPartitions is small, it succeeds, while if it's large, it fails. Partition was simply done by override def getPartition(key: A

[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Hi! The RANK function is available in hive since version 0.11. When trying to use it in SparkSQL, I'm getting the following exception (full stacktrace below): java.lang.ClassCastException: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be cast to org.apache.hadoop.hive.ql.

Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Hi, Is there a simple way to run spark sql queries against Sql Server databases? Or are we limited to running sql and doing sc.Parallelize()? Being able to query small amounts of lookup info directly from spark can save a bunch of annoying etl, and I'd expect Spark Sql to have some way of doing

Custom s3 endpoint

2014-10-21 Thread bobrik
I have s3-compatible service and I'd like to have access to it in spark. >From what I have gathered, I need to add "s3service.s3-endpoint=" to file jets3t.properties in classpath. I'm not java programmer and I'm not sure where to put it in hello-world example. I managed to make it work with "loca

Re: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Cheng Lian
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote: Hi, Is there a simple way to run spark sq

create a Row Matrix

2014-10-21 Thread viola
Hi, I am VERY new to spark and mllib and ran into a couple of problems while trying to reproduce some examples. I am aware that this is a very simple question but could somebody please give me an example - how to create a RowMatrix in scala with the following entries: [1 2 3 4]? I would like to

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val personPath = "/hdd/spark/person.json" val person = sqlContext.jsonFile(personPath) person.printSchema() person.registerTempTable("person") val addressPath = "/hdd/spark/address.json" val address = sqlContext.jsonFile(addressPath) address.

Re: why fetch failed

2014-10-21 Thread marylucy
thank you it works!akka timeout may be bottle-neck in my system > 在 Oct 20, 2014,17:07,"Akhil Das" 写道: > > I used to hit this issue when my data size was too large and the number of > partitions was too large ( > 1200 ), I got ride of it by > > - Reducing the number of partitions > - Setting

Re: why fetch failed

2014-10-21 Thread marylucy
thanks i need check spark 1.1.0 contain it > 在 Oct 21, 2014,0:01,"DB Tsai" 写道: > > I ran into the same issue when the dataset is very big. > > Marcelo from Cloudera found that it may be caused by SPARK-2711, so their > Spark 1.1 release reverted SPARK-2711, and the issue is gone. See > htt

RE: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Thanks. Didn't know about jdbcrdd...should do nicely for now. The foreign data source api looks interesting... Date: Tue, 21 Oct 2014 20:33:03 +0800 From: lian.cs@gmail.com To: as...@live.com; user@spark.apache.org Subject: Re: Getting Spark SQL talking to Sql Server Inst

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-21 Thread Arian Pasquali
That's true Guillaume. I'm currently aggregating documents considering a week as time range. I will have to make it daily and aggregate the results later. thanks for your hints anyway Arian Pasquali http://about.me/arianpasquali 2014-10-20 13:53 GMT+01:00 Guillaume Pitel : > Hi, > > The arr

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new

Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis
Collect will store the entire output in a List in memory. This solution is acceptable for "Little Data" problems although if the entire problem fits in the memory of a single machine there is less motivation to use Spark. Most problems which benefit from Spark are large enough that even the data a

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav
Hi Tridib, I changed SQLContext to HiveContext and it started working. These are steps I used. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val person = sqlContext.jsonFile("json/person.json") person.printSchema() person.registerTempTable("person") val address = sqlContext.jsonF

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. Thanks & Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTabl

Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava
Hi, I am creating a cassandra java rdd and transforming it using the where clause. It works fine when I run it outside the mapValues, but when I put the code in mapValues I get an error while creating the transformation. Below is my sample code: CassandraJavaRDD cassandraRefTable = javaFuncti

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Michael Armbrust
No, analytic and window functions do not work yet. On Tue, Oct 21, 2014 at 3:00 AM, Pierre B < pierre.borckm...@realimpactanalytics.com> wrote: > Hi! > > The RANK function is available in hive since version 0.11. > When trying to use it in SparkSQL, I'm getting the following exception > (full > s

disk-backing pyspark rdds?

2014-10-21 Thread Eric Jonas
Hi All! I'm getting my feet wet with pySpark for the fairly boring case of doing parameter sweeps for monte carlo runs. Each of my functions runs for a very long time (2h+) and return numpy arrays on the order of ~100 MB. That is, my spark applications look like def foo(x): np.random.seed(

stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
what could cause this type of 'stage failure'? Thanks! This is a simple py spark script to list data in hbase. command line: ./spark-submit --driver-class-path ~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py 14/10/21 17:53:50 INFO BlockManagerInfo: Added broadcast_2_pi

Re: stage failure: Task 0 in stage 0.0 failed 4 times

2014-10-21 Thread freedafeng
maybe set up a hbase.jar in the conf? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-21 Thread Terry Siu
Just to follow up, the queries worked against master and I got my whole flow rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with the next release of CDH5 :P -Terry From: Terry Siu mailto:terry@smartfocus.com>> Date: Monday, October 20, 2014 at 12:22 PM To: Michael

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Michael Armbrust
> Hmm... I thought HiveContext will only worki if Hive is present. I am > curious > to know when to use HiveContext and when to use SqlContext. > http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started TLDR; Always use HiveContext if your application does not have a depende

How to set hadoop native library path in spark-1.1

2014-10-21 Thread Pradeep Ch
Hi all, Can anyone tell me how to set the native library path in Spark. Right not I am setting it using "SPARK_LIBRARY_PATH" environmental variable in spark-env.sh. But still no success. I am still seeing this in spark-shell. NativeCodeLoader: Unable to load native-hadoop library for your platf

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Thank for pointing that out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Oh - and one other note on this, which appears to be the case. If , in your stream forEachRDD implementation, you do something stupid (like call rdd.count()) tweetStream.foreachRDD((rdd,lent)=> { tweetStream.repartition(1) numTweetsCollected+=1; //val count = rdd.count() DON

Class not found

2014-10-21 Thread Pat Ferrel
Not sure if this has been clearly explained here but since I took a day to track it down… Several people have experienced a class not found error on Spark when the class referenced is supposed to be in the Spark jars. One thing that can cause this is if you are building Spark for your cluster

How to calculate percentiles with Spark?

2014-10-21 Thread sparkuser
Hi, What would be the best way to get percentiles from a Spark RDD? I can see JavaDoubleRDD or MLlib's MultivariateStatisticalSummary provide the mean() but not percentiles. Thank you! Horace -- View this message in context: htt

Spark-Submit Python along with JAR

2014-10-21 Thread TJ Klein
Hi, I'd like to run my python script using "spark-submit" together with a JAR file containing Java specifications for a Hadoop file system. How can I do that? It seems I can either provide a JAR file or a PYthon file to spark-submit. So far I have been running my code in ipython with IPYTHON_OPTS

spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Any help? or comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Primitive arrays in Spark

2014-10-21 Thread Akshat Aranya
This is as much of a Scala question as a Spark question I have an RDD: val rdd1: RDD[(Long, Array[Long])] This RDD has duplicate keys that I can collapse such val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) => a++b) If I start with an Array of primitive longs in rdd1, will rdd2 als

MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Hi All,I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse dataIt is very common in practice to have sparse training data. MLlib s

Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Ok thanks Michael. In general, what's the easy way to figure out what's already implemented? The exception I was getting was not really helpful here? Also, is there a roadmap document somewhere ? Thanks! P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread freedafeng
Thanks for the help! Hadoop version: 2.3.0 Hbase version: 0.98.1 Use python to read/write data from/to hbase. Only change over the official spark 1.1.0 is the pom file under examples. Compilation: spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package spark/examples:mv

Re: Class not found

2014-10-21 Thread Pat Ferrel
Doesn’t this seem like a dangerous error prone hack? It will build different bits on different machines. It doesn’t even work on my linux box because the mvn install doesn’t cache the same as on the mac. If Spark is going to be supported on the maven repos shouldn’t it be addressed by different

Re: How to calculate percentiles with Spark?

2014-10-21 Thread lordjoe
A rather more general question is - assume I have an JavaRDD which is sorted - How can I convert this into a JavaPairRDD where the Integer is tie index -> 0...N - 1. Easy to do on one machine JavaRDD values = ... // create here JavaRDD positions = values.mapToPair(new PairFunction() {

Re: Class not found

2014-10-21 Thread Pat Ferrel
maven cache is laid out differently but it does work on Linux and BSD/mac. Still looks like a hack to me. On Oct 21, 2014, at 1:28 PM, Pat Ferrel wrote: Doesn’t this seem like a dangerous error prone hack? It will build different bits on different machines. It doesn’t even work on my linux box

com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001
I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftw

SchemaRDD.where clause error

2014-10-21 Thread Kevin Paul
Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql("select * from people") people.where('age === 10) :27: error: value === is not a member of Symbol where did I go wrong? Thanks, Kevin Paul

buffer overflow when running Kmeans

2014-10-21 Thread Yang
this is the stack trace I got with yarn logs -applicationId really no idea where to dig further. thanks! yang 14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [ phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21] 14/10/21 14:36:47 ERROR Executor: Exception in task ID 98 com.eso

Re: SchemaRDD.where clause error

2014-10-21 Thread Michael Armbrust
You need to "import sqlCtx._" to get access to the implicit conversion. On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul wrote: > Hi all, I tried to use the function SchemaRDD.where() but got some error: > > val people = sqlCtx.sql("select * from people") > people.where('age === 10) > > :27: erro

Spark - HiveContext - Unstructured Json

2014-10-21 Thread Harivardan Jayaraman
Hi, I have unstructured JSON as my input which may have extra columns row to row. I want to store these json rows using HiveContext so that it can be accessed from the JDBC Thrift Server. I notice there are primarily only two methods available on the SchemaRDD for data - saveAsTable and insertInto.

Re: buffer overflow when running Kmeans

2014-10-21 Thread Ted Yu
Just posted below for a similar question. Have you seen this thread ? http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflow&subj=RE+spark+nbsp+kryo+serilizable+nbsp+exception On Tue, Oct 21, 2014 at 2:44 PM, Yang wrote: > this is the stack trace I got with yarn logs -applica

How to read BZ2 XML file in Spark?

2014-10-21 Thread John Roberts
Hi, I want to ingest Open Street Map. It's 43GB (compressed) XML in BZIP2 format. What's your advice for reading it in to an RDD? BTW, the Spark Training at UMD is awesome! I'm having a blast learning Spark. I wish I could go to the MeetUp tonight, but I have kid activities... http://wiki.openst

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Is there any specific issues you are facing? Thanks, Yin On Tue, Oct 21, 2014 at 4:00 PM, tridib wrote: > Any help? or comments? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-

Re: MLLib libsvm format

2014-10-21 Thread Xiangrui Meng
Yes. "where the indices are one-based and **in ascending order**". -Xiangrui On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak wrote: > Hi All, > > I have a question regarding the ordering of indices. The document says that > the indices indices are one-based and in ascending order. However, do the >

spark ui redirecting to port 8100

2014-10-21 Thread sadhan
Set up the spark port to a different one and the connection seems successful but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html Sent

Re: create a Row Matrix

2014-10-21 Thread Xiangrui Meng
Please check out the example code: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala -Xiangrui On Tue, Oct 21, 2014 at 5:34 AM, viola wrote: > Hi, > > I am VERY new to spark and mllib and ran into a couple of problems while > t

RE: MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Great, I will sort them. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Xiangrui Meng Date:10/21/2014 3:29 PM (GMT-08:00) To: Sameer Tilak Cc: user@spark.apache.org Subject: Re: MLLib libsvm format Yes. "where the indices are one-based

Re: How to read BZ2 XML file in Spark?

2014-10-21 Thread sameerf
Hi John, Glad you're enjoying the Spark training at UMD. Is the 43 GB XML data in a single file or split across multiple BZIP2 files? Is the file in a HDFS cluster or on a single linux machine? If you're using BZIP2 with splittable compression (in HDFS), you'll need at least Hadoop 1.1: https://

Re: spark ui redirecting to port 8100

2014-10-21 Thread Sameer Farooqui
Hi Sadhan, Which port are you specifically trying to redirect? The driver program has a web UI, typically on port 4040... or the Spark Standalone Cluster Master has a UI exposed on port 7077. Which setting did you update in which file to make this change? And finally, which version of Spark are

Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Hello, Spark 1.1.0, Hadoop 2.4.1 I have written a Spark streaming application. And I am getting FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath). Here is brief what I am is trying to do. My application is creating text file stream using Java Stream context. The input file is on

Re: Usage of spark-ec2: how to deploy a revised version of spark 1.1.0?

2014-10-21 Thread sameerf
Hi, Can you post what the error looks like? Sameer F. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread Joseph Bradley
Hi, this sounds like a bug which has been fixed in the current master. What version of Spark are you using? Would it be possible to update to the current master? If not, it would be helpful to know some more of the problem dimensions (num examples, num features, feature types, label type). Thanks,

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Sameer Farooqui
Hi Shailesh, Spark just leverages the Hadoop File Output Format to write out the RDD you are saving. This is really a Hadoop OutputFormat limitation which requires the directory it is writing into to not exist. The idea is that a Hadoop job should not be able to overwrite the results from a previ

Using the DataStax Cassandra Connector from PySpark

2014-10-21 Thread Mike Sukmanowsky
Hi there, I'm using Spark 1.1.0 and experimenting with trying to use the DataStax Cassandra Connector (https://github.com/datastax/spark-cassandra-connector) from within PySpark. As a baby step, I'm simply trying to validate that I have access to classes that I'd need via Py4J. Sample python prog

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Vipul Pandey
any word on this one? I would like to get this done as well. Although, my real use case is to do something on each executor right up in the beginning - and I was trying to hack it using broadcasts by broadcasting an object of my own and do whatever I want in the readObject method. Any other way

Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another one: scala> val a = Array[Long](1,2,3) a: Array[Long] = Array(1, 2, 3) scala> val b = Array[Long](1,2,3) b: Array[Long] = Array(1, 2, 3) scala> a ++ b res0: Array[Long] = Array(1, 2, 3, 1, 2, 3) scala> res0.getClas

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Thanks Sameer for quick reply. I will try to implement it. Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: spark sql not able to find classes with --jars option

2014-10-21 Thread sadhan
It was mainly because spark was setting the jar classes in a thread local context classloader. The quick fix was to make our serde use the context classloader first. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-not-able-to-find-classes-with-jars

Re: Strategies for reading large numbers of files

2014-10-21 Thread Landon Kuhn
Thanks to folks here for the suggestions. I ended up settling on what seems to be a simple and scalable approach. I am no longer using sparkContext.textFiles with wildcards (it is too slow when working with a large number of files). Instead, I have implemented directory traversal as a Spark job, wh

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Yes, I am unable to use jsonFile() so that it can detect date type automatically from json data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html Sent from the Apache Spark

Spark Streaming Applications

2014-10-21 Thread Saiph Kappa
Hi, I have been trying to find a fairly complex application that makes use of the Spark Streaming framework. I checked public github repos but the examples I found were too simple, only comprising simple operations like counters and sums. On the Spark summit website, I could find very interesting

spark 1.1.0 RDD and Calliope 1.1.0-CTP-U2-H2

2014-10-21 Thread Tian Zhang
Hi, I am using the latest calliope library from tuplejump.com to create RDD for cassandra table. I am on a 3 nodes spark 1.1.0 with yarn. My cassandra table is defined as below and I have about 2000 rows of data inserted. CREATE TABLE top_shows ( program_id varchar, view_minute timestamp, vi

Re: spark-ec2 script with VPC

2014-10-21 Thread Mike Jennings
You can give this patch a try. Let me know if you find any problems. https://github.com/apache/spark/pull/2872 Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-with-VPC-tp11482p16978.html Sent from the Apache Spark User List m

Re: Spark Cassandra connector issue

2014-10-21 Thread Ankur Srivastava
Is this because I am calling a transformation function on an rdd from inside another transformation function? Is it not allowed? Thanks Ankut On Oct 21, 2014 1:59 PM, "Ankur Srivastava" wrote: > Hi Gerard, > > this is the code that may be helpful. > > public class ReferenceDataJoin implements S

Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Add one more thing about question 1. Once you get the SchemaRDD from jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to cast the column type from the StringType to DateType (the string format should be "-[m]m-[d]d" and you need to use hiveContext). Here is the code snippet

Re: Spark - HiveContext - Unstructured Json

2014-10-21 Thread Cheng Lian
You can resort to |SQLContext.jsonFile(path: String, samplingRate: Double)| and set |samplingRate| to 1.0, so that all the columns can be inferred. You can also use |SQLContext.applySchema| to specify your own schema (which is a |StructType|). On 10/22/14 5:56 AM, Harivardan Jayaraman wrote:

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Peng Cheng
Looks like the only way is to implement that feature. There is no way of hacking it into working -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html Sent from the Apache Spark User L

Re: com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread Koert Kuipers
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer resizing) or spark 1.0 (which has a fixed size buffer)? On Oct 21, 2014 5:30 PM, "nitinkak001" wrote: > I am running a simple rdd filter command. What does it mean? > Here is the full stack trace(and code below it): > > co

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread lokeshkumar
Hi Joseph I am using spark 1.1.0 the latest version, I will try to update to the current master and check. The example I am running is JavaDecisionTree, the dataset is of libsvm format containing 1. 45 instances of training sample. 2. 5 features 3. I am not sure what is feature type, but there

Num-executors and executor-cores overwritten by defaults

2014-10-21 Thread Ilya Ganelin
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can no longer set the number of executors or executor-cores. No matter what values I pass on the command line to spark they are overwritten by the defaults. Does anyone have any idea what could have happened here? Running on Sp

spark sql query optimization , and decision tree building

2014-10-21 Thread sanath kumar
Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decis

Re: create a Row Matrix

2014-10-21 Thread viola
Thanks for the quick response. However, I still only get error messages. I am able to load a .txt file with entries in it and use it in sparks, but I am not able to create a simple matrix, for instance a 2x2 row matrix [1 2 3 4] I tried variations such as val RowMatrix = Matrix(2,2,array(1,3,2,4))

Subscription request

2014-10-21 Thread Sathya
Hi, Kindly subscribe me to the user group. Regards, Sathyanarayanan

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-21 Thread Akhil Das
Hi Buntu, You could something similar to the following: val receiver_stream = new ReceiverInputDStream(ssc) { > override def getReceiver(): Receiver[Nothing] = ??? //Whatever > }.map((x : String) => (null, x)) > val config = new Configuration() > config.set("mongo.output.uri",