Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-13 Thread Akhil Das
Adding your application jar to the sparkContext will resolve this issue. Eg: sparkContext.addJar(./target/scala-2.10/myTestApp_2.10-1.0.jar) Thanks Best Regards On Mon, Oct 13, 2014 at 8:42 AM, Tao Xiao xiaotao.cs@gmail.com wrote: In the beginning I tried to read HBase and found that

Re: setting heap space

2014-10-13 Thread Akhil Das
Few things to keep in mind: - I believe Driver memory should not exceed executor memory - Set spark.storage.memoryFraction default is 0.6 - Set spark.rdd.compress default is set to false - Always specify the level of parallelism while doing a groupBy, reduceBy, join, sortBy etc. - If you don't

Re: setting heap space

2014-10-13 Thread Chengi Liu
Hi Akhil, Thanks for the response.. Another query... do you know how to use spark.executor.extraJavaOptions option? SparkConf.set(spark.executor.extraJavaOptions,what value should go in here)? I am trying to find an example but cannot seem to find the same.. On Mon, Oct 13, 2014 at 12:03 AM,

Re: setting heap space

2014-10-13 Thread Akhil Das
like this you can set: sparkConf.set(spark.executor.extraJavaOptions, -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ) Here's a benchmark example

Re: setting heap space

2014-10-13 Thread Chengi Liu
Cool.. Thanks.. And one last final question.. conf = SparkConf.set().set(...) matrix = get_data(..) rdd = sc.parallelize(matrix) # heap error here... How and where do I set set the storage level.. seems like conf is the wrong place to set this thing up..?? as I get this error:

Re: Spark SQL - Exception only when using cacheTable

2014-10-13 Thread poiuytrez
This is how the table was created: transactions = parts.map(lambda p: Row(customer_id=long(p[0]), chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]), brand=int(p[5]), date=str(p[6]), productsize=float(p[7]), productmeasure=str(p[8]), purchasequantity=int(p[9]),

Re: setting heap space

2014-10-13 Thread Akhil Das
Like this: import org.apache.spark.storage.StorageLevel val rdd = sc.parallelize(1 to 100).persist(StorageLevel.MEMORY_AND_DISK_SER) Thanks Best Regards On Mon, Oct 13, 2014 at 12:50 PM, Chengi Liu chengi.liu...@gmail.com wrote: Cool.. Thanks.. And one last final question.. conf =

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running a bunch sql queries on spark 1.2.0 built from latest master HEAD. Any updates to this issue? My main task is to join a huge fact table with a dozen

Re: SparkSQL on Hive error

2014-10-13 Thread Kevin Paul
Thanks Michael, your patch works for me :) Regards, Kelvin Paul On Fri, Oct 3, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Are you running master? There was briefly a regression here that is hopefully fixed by spark#2635 https://github.com/apache/spark/pull/2635. On Fri,

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running

Setting SparkSQL configuration

2014-10-13 Thread Kevin Paul
Hi all, I tried to set the configuration spark.sql.inMemoryColumnarStorage.compressed, and spark.sql.inMemoryColumnarStorage.batchSize in spark.executor.extraJavaOptions but it does not work, my spark.executor.extraJavaOptions contains Dspark.sql.inMemoryColumnarStorage.compressed=true

Issue with Spark Twitter Streaming

2014-10-13 Thread Jahagirdar, Madhu
All, We are using Spark Streaming to receive data from twitter stream. This is running behind proxy. We have done the following configurations inside spark steaming for twitter4j to work behind proxy. def main(args: Array[String]) { val filters = Array(Modi)

Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread shahab
Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query Array Of Struct in json RDDs? root |-- createdAt: long (nullable = true) |-- id: string (nullable = true) |-- sessions: array (nullable = true) ||-- element:

Adding Supersonic to the Powered by Spark list

2014-10-13 Thread Maya Bercovitch
Hi, We are using Spark for several months now and will be glad to join the Spark family officially :) *Company Name*: Supersonic Supersonic is a mobile advertising company. *URL*: http://www.supersonicads.com/ *Components and use-cases*: Using Spark core for big data crunching, MLlib for

Configuration is not effective or configuration errors?

2014-10-13 Thread pol
Hi ALL, I set spark.storage.blockManagerSlaveTimeoutMs=10 in spark-default.conf file, but in http://driver:4040 page does not exist under the “Environment” tab of the settings, why? ps : new SparkConf() in code and use spark standalone mode. Thanks, chepoo

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread arthur.hk.c...@gmail.com
Hi, Thank you so much! By the way, what is the DATEADD function in Scala/Spark? or how to implement DATEADD(MONTH, 3, '2013-07-01')” and DATEADD(YEAR, 1, '2014-01-01')” in Spark or Hive? Regards Arthur On 12 Oct, 2014, at 12:03 pm, Ilya Ganelin ilgan...@gmail.com wrote: Because of how

Re: Setting SparkSQL configuration

2014-10-13 Thread Cheng Lian
Currently Spark SQL doesn’t support reading SQL specific configurations via system properties. But for |HiveContext|, you can put them in |hive-site.xml|. On 10/13/14 4:28 PM, Kevin Paul wrote: Hi all, I tried to set the configuration spark.sql.inMemoryColumnarStorage.compressed, and

Issue on running spark application in Yarn-cluster mode

2014-10-13 Thread vishnu86
When I execute the following in yarn-client mode its working fine and giving the result properly, but when i try to run in Yarn-cluster mode i am getting error spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Pierre B
Is it planned in a near future ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-custom-aggregation-function-UDAF-tp15784p16275.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkSQL LEFT JOIN problem

2014-10-13 Thread invkrh
Hi, Thank you Liquan. I just missed some in information in my previous post. I just solved the problem. Actually, I use the first line(schema header) of the CSV file to generate StructType and StructField. However, the input file is UTF-8 Unicode (*with* BOM), so the first char of the file is

Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
Hello! I'm new to Spark, and write a Java program with it. But when I implement the Accumulator, something confuse me, here is my implementation: /class /WeightAccumulatorParam /implements /AccumulatorParamWeight{ / @*Override*/ public Weight zero(/Weight /initialValue){ return new

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread Yin Huai
Question 1: Please check http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#hive-tables. Question 2: One workaround is to re-write it. You can use LEFT SEMI JOIN to implement the subquery with EXISTS and use LEFT OUTER JOIN + IS NULL to implement the subquery with NOT EXISTS. SELECT

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread Sean Owen
Are you sure you aren't actually trying to extend AccumulableParam instead of AccumulatorParam? The example you cite does the latter. I do not get a compile error from this example. You also didn't say what version of Spark. (Although there are a few things about the example that could be

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
Hi Cheng, I am using version 1.1.0. Looks like that bug was fixed sometime after 1.1.0 was released. Interestingly, I tried your code on 1.1.0 and it gives me a different (incorrect) result: case class T(a:String, ts:java.sql.Timestamp) val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Re: Is Array Of Struct supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab shahab.mok...@gmail.com wrote: Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query Array Of Struct in json RDDs?

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Can you try to use HiveContext? Its should work in 1.1. For SQLContext, this issues was not fixed in 1.1 and you need to use master branch at the moment. Thanks, Yin On Sun, Oct 12, 2014 at 5:20 PM, shahab shahab.mok...@gmail.com wrote: Hi, Apparently is it is possible to query

Question about SVM mlllib...

2014-10-13 Thread Alfonso Muñoz Muñoz
Dear friends, Is there any way to know what is the “predicted label” for each “input label”? [CODE] … Val model=SVMWithSGD.train(training,numIterations) Model.clearThreshold() Val scoreAndLabels = test.map{ point = Val score = model.predict(point.features)

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Seems the reason that you got wrong results was caused by timezone. The time in java.sql.Timestamp(long time) means milliseconds since January 1, 1970, 00:00:00 *GMT*. A negative number is the number of milliseconds before January 1, 1970, 00:00:00 *GMT*. However, in ts='1970-01-01 00:00:00',

Regarding java version requirement in spark 1.2.0 or upcoming releases

2014-10-13 Thread twinkle sachdeva
Hi, Can somebody please share the plans regarding java version's support for apache spark 1.2.0 or near future releases. Will java 8 become the all feature supported version in apache spark 1.2 or java 1.7 will suffice? Thanks,

Re: Question about SVM mlllib...

2014-10-13 Thread Sean Owen
Aside from some syntax errors, this looks like exactly how you do it, right? Except that you call clearThreshold(), which causes it to return the margin, not a 0/1 prediction. Don't call that. It will default to the behavior you want. On Mon, Oct 13, 2014 at 3:03 PM, Alfonso Muñoz Muñoz

Re: Regarding java version requirement in spark 1.2.0 or upcoming releases

2014-10-13 Thread Sean Owen
I have not heard any plans to even drop support for Java 6. I imagine it will remain that way for a while. Java 6 is sufficient. On Mon, Oct 13, 2014 at 3:37 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Can somebody please share the plans regarding java version's support for

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread shahab
Thanks Yin. I trued HiveQL and and it solved that problem. But now I have second query requirement : But since you are main developer behind JSON-Spark integration (I saw your presentation on youtube Easy JSON Data Manipulation in Spark), is it possible to perform aggregation kind queries, for

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
I am sorry that I forget mentioning the version of spark. The version is spark-1.1.0. I am pretty sure I am not trying to extend AccumulableParam, you can see the code of the implementation of WeightAccumulatorParam in my post, it implements the interface *AccumulatorParamWeight*. And I am not

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
Good guess, but that is not the reason. Look at this code: scala val data = sc.parallelize(132554880L::133554880L::Nil).map(i= T(i.toString, new java.sql.Timestamp(i))) data: org.apache.spark.rdd.RDD[T] = MappedRDD[17] at map at console:17 scala data.collect res3: Array[T] =

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Do you mean queries with group by and aggregation functions? Once you register the json dataset as a table, you can write queries like querying a regular table. You can join it with other tables and do aggregations. Is it what you were asking for? If not, can you give me a more

Re: Inconsistency of implementing accumulator in Java

2014-10-13 Thread WonderfullDay
Further more, the second building can be passed, but exception will be throw while running: Exception is NoClassDefFoundError. The situation is the same as the sample on line -- VectorAccumulatorParam. So to me, if I don't implement the addAccumulator function, I cannot custom the

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Yeah, it is not related to timezone. I think you hit this issue https://issues.apache.org/jira/browse/SPARK-3173 and it was fixed after 1.1 release. On Mon, Oct 13, 2014 at 11:24 AM, Mohammed Guller moham...@glassbeam.com wrote: Good guess, but that is not the reason. Look at this code:

SparkSQL: StringType for numeric comparison

2014-10-13 Thread invkrh
Hi, I am using SparkSQL 1.1.0. Actually, I have a table as following: root |-- account_id: string (nullable = false) |-- Birthday: string (nullable = true) |-- preferstore: string (nullable = true) |-- registstore: string (nullable = true) |-- gender: string (nullable = true) |--

read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
How can we read all parquet files in a directory in spark-sql. We are following this example which shows a way to read one file: // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.// The result of loading a Parquet file is also a SchemaRDD.val

SparkSQL: select syntax

2014-10-13 Thread invkrh
Hi all, A quick question on SparkSql *SELECT* syntax. Does it support queries like: *SELECT t1.*, t2.d, t2.e FROM t1 LEFT JOIN t2 on t1.a = t2.a* It always ends with the exception: *Exception in thread main java.lang.RuntimeException: [2.12] failure: string literal expected SELECT t1.*,

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Nicholas Chammas
Right now I believe the only supported option is to pass a comma-delimited list of paths. I've opened SPARK-3928: Support wildcard matches on Parquet files https://issues.apache.org/jira/browse/SPARK-3928 to request this feature. Nick On Mon, Oct 13, 2014 at 12:21 PM, Sadhan Sood

Re: parquetFile and wilcards

2014-10-13 Thread Nicholas Chammas
SPARK-3928: Support wildcard matches on Parquet files https://issues.apache.org/jira/browse/SPARK-3928 On Wed, Sep 24, 2014 at 2:14 PM, Michael Armbrust mich...@databricks.com wrote: We could certainly do this. The comma separated support is something I added. On Wed, Sep 24, 2014 at 10:20

RE: Spark SQL parser bug?

2014-10-13 Thread Mohammed Guller
That explains it. Thanks! Mohammed From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Monday, October 13, 2014 8:47 AM To: Mohammed Guller Cc: Cheng, Hao; Cheng Lian; user@spark.apache.org Subject: Re: Spark SQL parser bug? Yeah, it is not related to timezone. I think you hit this

S3 Bucket Access

2014-10-13 Thread Ranga
Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an IAMRole provider. I have set the KeyId and Secret using environment variables ( AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to access

RowMatrix PCA out of heap space error

2014-10-13 Thread Yang
I got this error when trying to perform PCA on a sparse matrix, each row has a nominal length of 8000, and there are 36k rows. each row has on average 3 elements being non-zero. I guess the total size is not that big. Exception in thread main java.lang.OutOfMemoryError: Java heap space at

Re: Hung spark executors don't count toward worker memory limit

2014-10-13 Thread Keith Simmons
Maybe I should put this another way. If spark has two jobs, A and B, both of which consume the entire allocated memory pool, is it expected that spark can launch B before the executor processes tied to A are completely terminated? On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons ke...@pulse.io

Re: RowMatrix PCA out of heap space error

2014-10-13 Thread Sean Owen
The Gramian is 8000 x 8000, dense, and full of 8-byte doubles. It's symmetric so can get away with storing it in ~256MB. The catch is that it's going to send around copies of this 256MB array. You may easily be running your driver out of memory given all the overheads and copies, or your

Re: Spark SQL HiveContext Projection Pushdown

2014-10-13 Thread Michael Armbrust
Is there any plan to support windowing queries? I know that Shark supported it in its last release and expected it to be already included. Someone from redhat is working on this. Unclear if it will make the 1.2 release.

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Michael Armbrust
Its not on the roadmap for 1.2. I'd suggest opening a JIRA. On Mon, Oct 13, 2014 at 4:28 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Is it planned in a near future ? -- View this message in context:

Re: SparkSQL: StringType for numeric comparison

2014-10-13 Thread Michael Armbrust
This conversion is done implicitly anytime you use a string column in an operation with a numeric column. If you run explain on your query you should see the cast that is inserted. This is intentional and based on the type semantics of Apache Hive. On Mon, Oct 13, 2014 at 9:03 AM, invkrh

Re: persist table schema in spark-sql

2014-10-13 Thread Michael Armbrust
If you are running a version 1.1 you can create external parquet tables. I'd recommend setting spark.sql.hive.convertMetastoreParquet=true. Here's a helper function to do it automatically: /** * Sugar for creating a Hive external table from a parquet path. */ def createParquetTable(name:

Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Aris
Hi guys, I am trying just parse out values from a CSV, everything is a numeric (Double) value, and the input text CSV data is about 1.3 GB in size. When inspect the Java Heap space used by SparkSubmit using JVisualiser VM, I end up eating up 8GB of memory! Moreover, by inspecting the

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hmm... it failed again, just lasted a little bit longer. Jianshi

Re: Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Sean Owen
A CSV element like 3.2, takes 4 bytes as text on disk, but, as a Double will always take 8 bytes. Is your input like this? that could explain it. You can map to Float in this case to halve the memory, if that works for your use case. This is just kind of how Strings and floating-point work in the

Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nick Chammas
Cross posting an interesting question on Stack Overflow http://stackoverflow.com/questions/26321947/multipart-uploads-to-amazon-s3-from-apache-spark . Nick -- View this message in context:

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
One thing made me very confused during debuggin is the error message. The important one WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. is of Log Level WARN. Jianshi

Re: mlib model viewing and saving

2014-10-13 Thread Joseph Bradley
Currently, printing (toString) gives a human-readable version of the tree, but it is not a format which is easy to save and load. That sort of serialization is in the works, but not available for trees right now. (Note that the current master actually has toString (for a short summary of the

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Daniil Osipov
Not directly related, but FWIW, EMR seems to back away from s3n usage: Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

Re: Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nicholas Chammas
Oh, that's a straight reversal from their position up until earlier this year http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5485.html . Was there an announcement explaining the change in recommendation? Nick On Mon, Oct 13, 2014 at 4:54 PM, Daniil

distributing Scala Map datatypes to RDD

2014-10-13 Thread jon.g.massey
Hi guys, Just starting out with Spark and following through a few tutorials, it seems the easiest way to get ones source data into an RDD is using the sc.parallelize function. Unfortunately, my local data is in multiple instances of MapK,V types, and the parallelize function only works on objects

Re: S3 Bucket Access

2014-10-13 Thread Ranga
Is there a way to specify a request header during the sparkContext.textFile call? - Ranga On Mon, Oct 13, 2014 at 11:03 AM, Ranga sra...@gmail.com wrote: Hi I am trying to access files/buckets in S3 and encountering a permissions issue. The buckets are configured to authenticate using an

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
The biggest scaling issue was supporting a large number of reduce tasks efficiently, which the JIRAs in that post handle. In particular, our current default shuffle (the hash-based one) has each map task open a separate file output stream for each reduce task, which wastes a lot of memory

Re: Why is parsing a CSV incredibly wasteful with Java Heap memory?

2014-10-13 Thread Aris
Thank you Sean. Moving over my data types from Double to Float was an (obvious) big win, and I discovered one more good optimization from the Tuning section -- I modified my original code to call .persist(MEMORY_ONLY_SER) from the FIRST import of the data, and I pass in --conf

Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread mildebrandt
Hello, After reading the following pages: https://spark.apache.org/docs/latest/building-with-maven.html http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html I came up with this build command: mvn -Dhadoop.version=1.0.3 -DskipTests -Pdeb -Phadoop-1.0.3 -Phive

Re: Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread Sean Owen
Yes, there is no hadoop-1.0.3 profile. You can look at the parent pom.xml to see the profiles that exist. It doesn't mean 1.0.3 doesn't work, just that there is nothing specific to activate for this version range. I don't think the docs suggest this profile exists. You don't need any extra

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread DB Tsai
For now, with SparkSPARK-3462 parquet pushdown for unionAll PR, you can do the following for unionAll schemaRDD. val files = Array(hdfs://file1.parquet, hdfs://file2.parquet, hdfs://file3.parquet) val rdds = paths.map(hc.parquetFile(_)) val unionedRDD = { var temp = rdds(0) for (i

Re: Problems building Spark for Hadoop 1.0.3

2014-10-13 Thread mildebrandt
Hi Sean, Thanks for the quick response. I'll give that a try. I'm still a little concerned that yarn support will be built into my target assembly. Are you aware of something I can check after the build competes to be sure that Spark doesn't look for Yarn during runtime? Thanks, -Chris --

SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with

Re: distributing Scala Map datatypes to RDD

2014-10-13 Thread Stephen Boesch
is the following what you are looking for? scala sc.parallelize(myMap.map{ case (k,v) = (k,v) }.toSeq) res2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:21 2014-10-13 14:02 GMT-07:00 jon.g.massey jon.g.mas...@gmail.com: Hi guys, Just

Spark Cluster health check

2014-10-13 Thread Tarun Garg
Hi All,I am doing a POC and written a Job in java. so the architecture has kafka and spark.Now i want a process to notify me whenever system performance is getting down or in crunch of resources, like CPU or RAM. I understand org.apache.spark.streaming.scheduler.StreamingListener, but it has

Re: distributing Scala Map datatypes to RDD

2014-10-13 Thread Sean Owen
Map.toSeq already does that even. You can skip the map. You can put together Maps with ++ too. You should have an RDD of pairs then, but to get the special RDD functions you're looking for remember to import SparkContext._ On Mon, Oct 13, 2014 at 10:58 PM, Stephen Boesch java...@gmail.com wrote:

Does SparkSQL work with custom defined SerDe?

2014-10-13 Thread Chen Song
In Hive, the table was created with custom SerDe, in the following way. row format serde abc.ProtobufSerDe with serdeproperties (serialization.class= abc.protobuf.generated.LogA$log_a) When I start spark-sql shell, I always got the following exception, even for a simple query. select user from

Re: S3 Bucket Access

2014-10-13 Thread Daniil Osipov
(Copying the user list) You should use spark_ec2 script to configure the cluster. If you use trunk version you can use the new --copy-aws-credentials option to configure the S3 parameters automatically, otherwise either include them in your SparkConf variable or add them to

Re: read all parquet files in a directory in spark-sql

2014-10-13 Thread Sadhan Sood
Thanks Nick, DB - that was helpful. On Mon, Oct 13, 2014 at 5:44 PM, DB Tsai dbt...@dbtsai.com wrote: For now, with SparkSPARK-3462 parquet pushdown for unionAll PR, you can do the following for unionAll schemaRDD. val files = Array(hdfs://file1.parquet, hdfs://file2.parquet,

How to construct graph in graphx

2014-10-13 Thread Soumitra Siddharth Johri
Hi, I am new to scala/graphx and am having problems converting a tsv file to a graph. I have a flat tab separated file like below: n1 P1 n2 n3 P1 n4 n2 P2 n3 n3 P2 n1 n1 P3 n4 n3 P3 n2 where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
Thank you for the details! Would you mind speaking to what tools proved most useful as far as identifying bottlenecks or bugs? Thanks again. On Oct 13, 2014 5:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: The biggest scaling issue was supporting a large number of reduce tasks efficiently,

Re: S3 Bucket Access

2014-10-13 Thread Ranga
Hi Daniil Could you provide some more details on how the cluster should be launched/configured? The EC2 instance that I am dealing with uses the concept of IAMRoles. I don't have any keyfile to specify to the spark-ec2 script. Thanks for your help. - Ranga On Mon, Oct 13, 2014 at 3:04 PM,

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers k/ P.S: What are you folks planning next ?

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 18:22:44 -0400, Soumitra Siddharth Johri soumitra.siddha...@gmail.com wrote: I have a flat tab separated file like below: [...] where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges between the nodes. How can I construct a

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu terry@smartfocus.com wrote: I am

Spark can't find jars

2014-10-13 Thread Andy Srine
Hi Guys, Spark rookie here. I am getting a file not found exception on the --jars. This is on the yarn cluster mode and I am running the following command on our recently upgraded Spark 1.1.1 environment. ./bin/spark-submit --verbose --master yarn --deploy-mode cluster --class myEngine

Re: Spark can't find jars

2014-10-13 Thread Jimmy
Having the exact same error with the exact same jar Do you work for Altiscale? :) J Sent from my iPhone On Oct 13, 2014, at 5:33 PM, Andy Srine andy.sr...@gmail.com wrote: Hi Guys, Spark rookie here. I am getting a file not found exception on the --jars. This is on the yarn

Re: Spark can't find jars

2014-10-13 Thread Jimmy McErlain
BTW this has always worked for me before until we upgraded the cluster to Spark 1.1.1... J ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: ji...@sellpoints.com *M*:

Re: Spark can't find jars

2014-10-13 Thread HARIPRIYA AYYALASOMAYAJULA
Well in the cluster, can you try copying the entire folder and then run? For example my home folder say helloWorld consists of the src, target etc. can you copy the entire folder in the cluster ? I doubt it is looking for some dependencies and is missing that when it runs your jar file. or if you

Re: Spark can't find jars

2014-10-13 Thread HARIPRIYA AYYALASOMAYAJULA
Or if it has something to do with the way you package your files - try another alternative method and see if it works On Monday, October 13, 2014, HARIPRIYA AYYALASOMAYAJULA aharipriy...@gmail.com wrote: Well in the cluster, can you try copying the entire folder and then run? For example my

Re: ClasssNotFoundExeception was thrown while trying to save rdd

2014-10-13 Thread Tao Xiao
Thanks Akhil. Both ways work for me, but I'd like to know why that exception was thrown. The class HBaseApp and related class were all contained in my application jar, why was *com.xt.scala.HBaseApp$$* *anonfun$testHBase$1* not found ? 2014-10-13 14:53 GMT+08:00 Akhil Das

Re: Processing order in Spark

2014-10-13 Thread Tobias Pfeiffer
Sean, thanks, I didn't know about repartitionAndSortWithinPartitions, that seems very helpful! Tobias

Can's create Kafka stream in spark shell

2014-10-13 Thread Gary Zhao
Hello I'm trying to connect kafka in spark shell, but failed as below. Could you take a look what I missed. scala val kafkaStream = KafkaUtils.createStream(ssc, test-vip.snc1:2181, test_spark, Map(user-test-1)) error: bad symbolic reference. A signature in KafkaUtils.class refers to term

some more heap space error

2014-10-13 Thread Chengi Liu
Hi, I posted a query yesterday and have tried out all the options given in responses.. Basically, I am reading a very fat matrix (2000 by 50 dimension matrix) and am trying to run kmeans on it. I keep on getting heap error.. Now, I am even using persist(StorageLevel.DISK_ONLY_2) option..

Re: How to construct graph in graphx

2014-10-13 Thread Ankur Dave
At 2014-10-13 21:08:15 -0400, Soumitra Johri soumitra.siddha...@gmail.com wrote: There is no 'long' field in my file. So when I form the edge I get a type mismatch error. Is it mandatory for GraphX that every vertex should have a distinct id. ? in my case n1,n2,n3,n4 are all strings. (+user