Re: Spark Cluster health check

2014-10-14 Thread Akhil Das
Hi Tarun, You can use Ganglia for monitoring the entire cluster, and if you want some more custom functionality like sending emails etc, then you can go after nagios. Thanks Best Regards On Tue, Oct 14, 2014 at 3:31 AM, Tarun Garg bigdat...@live.com wrote: Hi All, I am doing a POC and

Re: Can's create Kafka stream in spark shell

2014-10-14 Thread Akhil Das
Just make sure you have the same version of spark-streaming-kafka http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10 jar and spark in your classpath. Thanks Best Regards On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao garyz...@gmail.com wrote: Hello I'm trying to

Re: Application failure in yarn-cluster mode

2014-10-14 Thread Christophe Préaud
Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can anybody confirm if it is a bug, or a (configuration?) problem on my side? Thanks, Christophe. On 10/10/2014 18:24, Christophe Préaud wrote: Hi, After updating from spark-1.0.0 to spark-1.1.0, my

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread invkrh
Thank you, Michael. In Spark SQL DataType, we have a lot of types, for example, ByteType, ShortType, StringType, etc. These types are used to form the table schema. As for me, StringType is enough, why do we need others ? Hao -- View this message in context:

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Update: This syntax is mainly for avoiding retyping column names. Let's take the example in my previous post, where *a* is a table of 15 columns, *b* has 5 columns, after a join, I have a table of (15 + 5 - 1(key in b)) = 19 columns and register the table in sqlContext. I don't want to actually

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, Are you sure that the id/key that you used can access to s3? You can try to use the same id/key through python boto package to test it. Because, I have almost the same situation as yours, but I can access to s3. Best -- View this message in context:

Re: SparkSQL: select syntax

2014-10-14 Thread Gen
Hi, I met the same problem before, and according to Matei Zaharia: /The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how

Steps to connect BI Tools with Spark SQL using Thrift JDBC server

2014-10-14 Thread Neeraj Garg02
Hi Everybody, I'm looking for information on possible Thrift JDBC/ODBC clients and Thrift JDBC/ODBC servers. For example, how to connect Tableu or other BI/ analytics tools to Spark using ODBC/JDBC servers as mentioned over the Spark SQL page. Thanks and Regards, Neeraj Garg

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Thank you, Gen. I will give hiveContext a try. =) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: S3 Bucket Access

2014-10-14 Thread Akhil Das
Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set( ​ AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set( ​ AWS_SECRET_ACCESS_KEY,yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: ​ export

Re: S3 Bucket Access

2014-10-14 Thread Rafal Kwasny
Hi, keep in mind that you're going to have a bad time if your secret key contains a / This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a / /Raf Akhil Das wrote: Try the following: 1. Set the

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-14 Thread Pei-Lun Lee
I created https://issues.apache.org/jira/browse/SPARK-3947 On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust mich...@databricks.com wrote: Its not on the roadmap for 1.2. I'd suggest opening a JIRA. On Mon, Oct 13, 2014 at 4:28 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote:

YARN deployment of Spark and Thrift JDBC server

2014-10-14 Thread Neeraj Garg02
Hi All, I've downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4. Now, I want to test two features of Spark: 1. YARN deployment : As per my understanding, I need to modify spark-defaults.conf file with the settings mentioned at URL

Default spark.deploy.recoveryMode

2014-10-14 Thread Priya Ch
Hi Spark users/experts, In Spark source code (Master.scala Worker.scala), when registering the worker with master, I see the usage of *persistenceEngine*. When we don't specify spark.deploy.recovery mode explicitly, what is the default value used ? This recovery mode is used to persists and

Re: a hivectx insertinto issue

2014-10-14 Thread valgrind_girl
It seems that the second problem is dependency issue. and it works exactly like the first one. *this is the complete code:* JavaSchemaRDD schemas=ctx.jsonRDD(arg0); schemas.insertInto(test, true); JavaSchemaRDD teeagers=ctx.hql(SELECT a,b FROM test); ListString

Re: Spark can't find jars

2014-10-14 Thread Christophe Préaud
Hello, I have already posted a message with the exact same problem, and proposed a patch (the subject is Application failure in yarn-cluster mode). Can you test it, and see if it works for you? I would be glad too if someone can confirm that it is a bug in Spark 1.1.0. Regards, Christophe. On

Re: a hivectx insertinto issue

2014-10-14 Thread valgrind_girl
sorry ,a mistake by me. the above code generate a result exactly like the one seen from hive. NOW my question is can a hive table be applied to insertinto function? why I keep geting 111,NULL instead of 111,222 -- View this message in context:

Initial job has not accepted any resources when launching SparkPi example on a worker.

2014-10-14 Thread Theodore Si
Hi all, I have two nodes, one as master(*host1*) and the other as worker(*host2*). I am using the standalone mode. After starting the master on host1, I run $ export MASTER=spark://host1:7077 $ bin/run-example SparkPi 10 on host2, but I get this: 14/10/14 21:54:23 WARN TaskSchedulerImpl:

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2014-10-14 Thread neeraj
I'm trying to get some workaround for this issue. Thanks and Regards, Neeraj Garg From: H4ml3t [via Apache Spark User List] [mailto:ml-node+s1001560n16379...@n3.nabble.com] Sent: Tuesday, October 14, 2014 6:53 PM To: Neeraj Garg02 Subject: Re: Running Spark/YARN on AWS EMR - Issues finding file

Re: Steps to connect BI Tools with Spark SQL using Thrift JDBC server

2014-10-14 Thread Cheng Lian
Denny Lee wrote an awesome article on how to connect to Tableau to Spark SQL recently: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql On 10/14/14 6:10 PM, Neeraj Garg02 wrote: Hi Everybody, I’m looking for information on possible Thrift JDBC/ODBC clients and Thrift JDBC/ODBC

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-14 Thread Cheng Lian
On 10/14/14 7:31 PM, Neeraj Garg02 wrote: Hi All, I’ve downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4. Now, I want to test two features of Spark: |1.|*YARN deployment* : As per my understanding, I need to modify “spark-defaults.conf” file with the settings mentioned

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
Thanks Michael. We are running 1.1 and I believe that is the latest release? I am getting this error when I tried doing what you suggested: org.apache.spark.sql.parquet.ParquetTypesConverter$ ParquetTypes.scalajava.lang.RuntimeException: [2.1] failure: ``UNCACHE'' expected but identifier CREATE

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
I realized my mistake of not using hiveContext. So that error is gone but now I am getting this error: == FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Unknown field info: binary -- View this message

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com Cc:

TF-IDF in Spark 1.1.0

2014-10-14 Thread Burke Webster
I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import

foreachPartition and task status

2014-10-14 Thread Salman Haq
Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know that the tasks have completed for all partitions? Thanks, Salman

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain / at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara
Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq sal...@revmetrix.com wrote: Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know

RE: Spark Cluster health check

2014-10-14 Thread Tarun Garg
Thanks for your response, it is not about infrastructure because I am using EC2 machines and Amazon cloud watch can provide EC2 nodes cpu usage, memory usage details but I need to send notification in situation like processing delay, total delay, Maximum rate is low,etc. Tarun Date: Tue, 14

Re: Spark can't find jars

2014-10-14 Thread Jimmy McErlain
So the only way that I could make this work was to build a fat jar file as suggested earlier. To me (and I am no expert) it seems like this is a bug. Everything was working for me prior to our upgrade to Spark 1.1 on Hadoop 2.2 but now it seems to not... ie packaging my jars locally then

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread Michael Armbrust
Its much more efficient to store and compute on numeric types than string types. On Tue, Oct 14, 2014 at 1:25 AM, invkrh inv...@gmail.com wrote: Thank you, Michael. In Spark SQL DataType, we have a lot of types, for example, ByteType, ShortType, StringType, etc. These types are used to

spark 1.1.0/yarn hang

2014-10-14 Thread tian zhang
Hi, I have spark 1.1.0 yarn installation. I am using spark-submit to run a simple application. From the console output, I have 769 partitions and after task 768 in stage 0 (count) finished, it hangs. I used jstack to dump the stacktop and it shows it is waiting ... Any suggestion what might go

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp

Re: foreachPartition and task status

2014-10-14 Thread Salman Haq
On Tue, Oct 14, 2014 at 12:42 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Are you using spark streaming? No, not at this time.

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the input. Yes, I did use the temporary access credentials provided by the IAM role (also detailed in the link you provided). The session token needs to be specified and I was looking for a way to set that in the header (which doesn't seem possible). Looks like a static key/secret is

Re: Spark Cluster health check

2014-10-14 Thread Akhil Das
Yes, for that you can tweak nagios a bit or you can write a custom monitoring applicaton which will check the processing delay etc. Thanks Best Regards On Tue, Oct 14, 2014 at 10:16 PM, Tarun Garg bigdat...@live.com wrote: Thanks for your response, it is not about infrastructure because I am

RE: Spark Cluster health check

2014-10-14 Thread Tarun Garg
I am planning to write custom monitoring application for that i analyse org.apache.spark.streaming.scheduler.StreamingListener, is there another spark streaming api which can give me the insight of the cluster like total processing time, delay, etc Tarun Date: Tue, 14 Oct 2014 23:19:47 +0530

Re: SparkStreaming program does not start

2014-10-14 Thread spr
Thanks Abraham Jacob, Tobias Pfeiffer, Akhil Das-2, and Sean Owen for your helpful comments. Cockpit error on my part in just putting the .scala file as an argument rather than redirecting stdin from it. -- View this message in context:

Re: S3 Bucket Access

2014-10-14 Thread Ranga
One related question. Could I specify the com.amazonaws.services.s3.AmazonS3Client implementation for the fs.s3.impl parameter? Let me try that and update this thread with my findings. On Tue, Oct 14, 2014 at 10:48 AM, Ranga sra...@gmail.com wrote: Thanks for the input. Yes, I did use the

RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi All, I have a Graph with millions of edges. Edges are represented by org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = MappedRDD[4] . I have two questions : 1)How can I fetch all the nodes with a given edge label ( all edges with a given property ) 2) Is it possible to create

spark sql union all is slow

2014-10-14 Thread shuluster
I have many tables of same schema, they are partitioned by time. For example one id could be in many of those table. I would like to find aggregation of such ids. Originally these tables are located on HDFS as files. Once table schemaRDD is loaded, I cacheTable on them. Each table is around 30m -

Re: Spark Streaming

2014-10-14 Thread st553
Have you solved this issue? im also wondering how to stream an existing file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

SPARK_SUBMIT_CLASSPATH question

2014-10-14 Thread Greg Hill
It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as other tools to put wildcards in the paths you add. For some reason it doesn't pick up the classpath information from yarn-site.xml either, it seems, when running on YARN. I'm having to manually add every single

Re: graphx - mutable?

2014-10-14 Thread ll
hi again. just want to check in again to see if anyone could advise on how to implement a mutable, growing graph with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of our algorithm. it doesn't look like there is an obvious way to add a

Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-14 Thread SK
Hi, I am trying to implement simple sentiment analysis of Twitter streams in Spark/Scala. I am getting an exception and it appears when I combine SparkContext with StreamingContext in the same program. When I read the positive and negative words using only SparkContext.textFile (without creating

mllib CoordinateMatrix

2014-10-14 Thread ll
after creating a coordinate matrix from my rdd[matrixentry]... 1. how can i get/query the value at coordiate (i, j)? 2. how can i set/update the value at coordiate (i, j)? 3. how can i get all the values on a specific row i, ideally as a vector? thanks! -- View this message in context:

Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys, I am new to Spark. When I run Spark Kmeans (org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works great. However, when using a large dataset with 1.5 million vectors, it just hangs there at some reducyByKey/collectAsMap stages (attached image shows the corresponding UI).

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again. just want to check in again to see if anyone could advise on how to implement a mutable, growing graph with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of

A question about streaming throughput

2014-10-14 Thread danilopds
Hi, I'm learning about Apache Spark Streaming and I'm doing some tests. Now, I have a modified version of the app NetworkWordCount that perform a /reduceByKeyAndWindow/ with window of 10 seconds in intervals of 5 seconds. I'm using also the function to measure the rate of records/second like

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys, An interesting thing, for the input dataset which has 1.5 million vectors, if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned above. However, if decrease k_value = 10, the same error still appears in the log but the application finished successfully, without

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again.

Re: TF-IDF in Spark 1.1.0

2014-10-14 Thread Xiangrui Meng
You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val

Re: mllib CoordinateMatrix

2014-10-14 Thread Reza Zadeh
Hello, CoordinateMatrix is in its infancy, and right now is only a placeholder. To get/set the value at (i,j), you should map the entries rdd using the usual rdd map operation, and change the relevant entries. To get the values on a specific row, you can call toIndexedRowMatrix(), which returns

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
What is the feature dimension? I saw you used 100 partitions. How many cores does your cluster have? -Xiangrui On Tue, Oct 14, 2014 at 1:51 PM, Ray ray-w...@outlook.com wrote: Hi guys, An interesting thing, for the input dataset which has 1.5 million vectors, if set the KMeans's k_value = 100

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets,

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal

Re: mllib CoordinateMatrix

2014-10-14 Thread Duy Huynh
thanks reza! On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh r...@databricks.com wrote: Hello, CoordinateMatrix is in its infancy, and right now is only a placeholder. To get/set the value at (i,j), you should map the entries rdd using the usual rdd map operation, and change the relevant

submitted uber-jar not seeing spark-assembly.jar at worker

2014-10-14 Thread Tamas Sandor
Hi, I'm rookie in spark, but hope someone can help me out. I'm writing an app that I'm submitting to my spark-master that has a worker on a separate node. It uses spark-cassandra-connector, and since it depends on guava-v16 and it conflicts with the default spark-1.1.0-assembly's guava-v14.1 I

Re: S3 Bucket Access

2014-10-14 Thread Rishi Pidva
As per EMR documentation: http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html Access AWS Resources Using IAM Roles If you've launched your cluster with an IAM role, applications running on the EC2 instances of that cluster can use the IAM role to obtain

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks Rishi. That is exactly what I am trying to do now :) On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva rpi...@pivotal.io wrote: As per EMR documentation: http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html Access AWS Resources Using IAM Roles If

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can see the application got 201 vCores. From the spark UI, I can

Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-14 Thread Michael Campbell
Hey all, I'm trying a very basic spark SQL job and apologies as I'm new to a lot of this, but I'm getting this failure: Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.sql.SchemaRDD.take(I)[Lorg/apache/spark/sql/catalyst/expressions/Row; I've tried a variety of uber-jar

MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Aris
Hi folks, When I am predicting Binary 1/0 responses with LogsticRegressionWithSGD, it returns a LogisticRegressionModel. In Spark 1.0.X I was using the clearThreshold method on the model to get the raw predicted probabilities when I ran the predict() method... It appears now that rather than

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Aris
Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold() DOES IN FACT work. It appears the the LogsticRegressionWithSGD returns a model whose method is broken!! On Tue, Oct 14, 2014 at 3:14 PM, Aris arisofala...@gmail.com wrote: Hi folks, When I am predicting Binary 1/0

rule engine based on spark

2014-10-14 Thread salemi
hi, is the a rule engine based on spark? i like to allow the business user to define their rules in a language and the execution of the rules should be done in spark. Thanks, Ali -- View this message in context:

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Xiangrui Meng
LBFGS is better. If you data is easily separable, LR might return values very close or equal to either 0.0 or 1.0. It is rare but it may happen. -Xiangrui On Tue, Oct 14, 2014 at 3:18 PM, Aris arisofala...@gmail.com wrote: Wow...I just tried LogisticRegressionWithLBFGS, and using

Re: Spark Streaming

2014-10-14 Thread hutashan
Yes.. I solved this problem using fileStream instead of textfile StreamingContext scc = new StreamingContext(conf, new Duration(1)); ClassTag LongWritable k = ClassTag$.MODULE$.apply(LongWritable.class); ClassTag Text v = ClassTag$.MODULE$.apply(Text.class);

Re: RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi, With respect to the first issue, one possible way is to filter the graph via 'graph.subgraph(epred = e = e.attr == edgeLabel)' , but I am still curious if we can index RDDs. Warm Regards Soumitra On Tue, Oct 14, 2014 at 2:46 PM, Soumitra Johri soumitra.siddha...@gmail.com wrote: Hi

Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Looks like it may be related to https://issues.apache.org/jira/browse/SPARK-3807. I will build from branch 1.1 to see if the issue is resolved. Chen On Tue, Oct 14, 2014 at 10:33 AM, Chen Song chen.song...@gmail.com wrote: Sorry for bringing this out again, as I have no clue what could have

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Burak, In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1. In the current test, I increase num-executors = 200. In the storage info 2 (as shown below), 11 executors are used (I think the data is kind of balanced) and others have zero memory usage.

Re: rule engine based on spark

2014-10-14 Thread Mayur Rustagi
We are developing something similar on top of Streaming. Could you detail some rule functionality you are looking for. We are developing a dsl for data processing on top of streaming as well as static data enabled on Spark. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257

Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
Hi All, I am trying to understand what is going on in my simple WordCount Spark Streaming application. Here is the setup - I have a Kafka producer that is streaming words (lines of text). On the flip side, I have a spark streaming application that uses the high-level Kafka/Spark connector to

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50. It worked fine. Ray, the error log you posted is after cluster termination, which is not the root cause. Could you search your log and find the real cause? On the executor tab screenshot, I saw only 200MB is used. Did you cache

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
Yeah... it totally should be... There is nothing fancy in there - import org.apache.spark.api.java.function.Function2; public class ReduceWords implements Function2Integer, Integer, Integer { private static final long serialVersionUID = -6076139388549335886L; public Integer call(Integer

How to I get at a SparkContext or better a JavaSparkContext from the middle of a function

2014-10-14 Thread Steve Lewis
I am running a couple of functions on an RDD which require access to data on the file system known to the context. If I create a class with a context a a member variable I get a serialization error, So I am running my function on some slave and I want to read in data from a Path defined by a

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
The results are no different - import org.apache.spark.api.java.function.Function2; import java.io.Serializable; public class ReduceWords implements Serializable, Function2Integer, Integer, Integer { private static final long serialVersionUID = -6076139388549335886L; public Integer

Spark output to s3 extremely slow

2014-10-14 Thread anny9699
Hi, I found writing output back to s3 using rdd.saveAsTextFile() is extremely slow, much slower than reading from s3. Is there a way to make it faster? The rdd has 150 partitions so parallelism is enough I assume. Thanks a lot! Anny -- View this message in context:

How to create Track per vehicle using spark RDD

2014-10-14 Thread Manas Kar
Hi, I have an RDD containing Vehicle Number , timestamp, Position. I want to get the lag function equivalent to my RDD to be able to create track segment of each Vehicle. Any help? PS: I have tried reduceByKey and then splitting the List of position in tuples. For me it runs out of memory

Re: How to patch sparkSQL on EC2?

2014-10-14 Thread Christos Kozanitis Christos Kozanitis
ahhh never mind… I didn’t notice that a spark-assembly jar file gets produced after compiling the whole spark suite… So no more manual editing of the jar file of the AMI for now! Christos On Oct 10, 2014, at 12:15 AM, Christos Kozanitis wrote: Hi I have written a few extensions for

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause. KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO

something about rdd.collect

2014-10-14 Thread randylu
My code is as follows: *documents.flatMap(case words = words.map(w = (w, 1))).reduceByKey(_ + _).collect()* In driver's log, reduceByKey() is finished, but collect() seems always in run, just can't be finished. In additional, there are about 200,000,000 words needs to be collected. Is it

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu terry@smartfocus.com wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin
Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. --  Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu

kryos serializer

2014-10-14 Thread dizzy5112
Hi all, how can i tell if my kryos serializer is actually working. I have a class which extends Serializable and i have included the following imports: import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator i also have class included MyRegistrator extends

How to write data into Hive partitioned Parquet table?

2014-10-14 Thread Banias H
Hi, I am still new to Spark. Sorry if similar questions are asked here before. I am trying to read a Hive table; then run a query and save the result into a Hive partitioned Parquet table. For example, I was able to run the following in Hive: INSERT INTO TABLE target_table PARTITION

pyspark - extract 1 field from string

2014-10-14 Thread Chop
I'm stumped with how to take 1 RDD that has lines like: 4,01012009,00:00,1289,4 5,01012009,00:00,1326,4 6,01012009,00:00,1497,7 and produce a new RDD with just the 4th field from each line (1289, 1326, 1497) I don't want to apply a conditional, I just want to grab that one field from each

Re: spark sql union all is slow

2014-10-14 Thread Pei-Lun Lee
Hi, You can merge them into one table by: sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1), sqlContext.table(table_2)), sqlContext.table(table_3)).registarTempTable(table_all) Or load them in one call by:

Re: something about rdd.collect

2014-10-14 Thread randylu
Thanks rxin, I still have a doubt about collect(). Word's number before reduceByKey() is about 200 million, and after reduceByKey() it decreases to 18 million. Memory for driver is initialized 15GB, then I print out runtime.freeMemory() before reduceByKey(), it indicates 13GB free memory. I

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-14 Thread Ilya Ganelin
Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Akhil - I tried your suggestions and tried varying my partition sizes. Reducing the number of partitions led to

Re: something about rdd.collect

2014-10-14 Thread randylu
If memory is not enough, OutOfMemory exception should occur, but nothing in driver's log. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/something-about-rdd-collect-tp16451p16461.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark - extract 1 field from string

2014-10-14 Thread Davies Liu
rdd.map(lambda line: int(line.split(',')[3])) On Tue, Oct 14, 2014 at 6:58 PM, Chop thomrog...@att.net wrote: I'm stumped with how to take 1 RDD that has lines like: 4,01012009,00:00,1289,4 5,01012009,00:00,1326,4 6,01012009,00:00,1497,7 and produce a new RDD with just the 4th field

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
I used k-means||, which is the default. And it took less than 1 minute to finish. 50 iterations took less than 25 minutes on a cluster of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it yarn-client? -Xiangrui On Tue, Oct 14, 2014 at 6:03 PM, Ray ray-w...@outlook.com wrote: Hi

Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but lag function isnt familiar to me .. But have you tried configuring the spark appropriately http://spark.apache.org/docs/latest/configuration.html On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have an RDD containing Vehicle Number