Re: Can's create Kafka stream in spark shell

2014-10-14 Thread Akhil Das
Just make sure you have the same version of spark-streaming-kafka jar and spark in your classpath. Thanks Best Regards On Tue, Oct 14, 2014 at 9:02 AM, Gary Zhao wrote: > Hello > > I'm trying to connect kafka in spa

Re: Application failure in yarn-cluster mode

2014-10-14 Thread Christophe Préaud
Hi, Sorry to insist, but I really feel like the problem described below is a bug in Spark. Can anybody confirm if it is a bug, or a (configuration?) problem on my side? Thanks, Christophe. On 10/10/2014 18:24, Christophe Préaud wrote: Hi, After updating from spark-1.0.0 to spark-1.1.0, my spar

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread invkrh
Thank you, Michael. In Spark SQL DataType, we have a lot of types, for example, ByteType, ShortType, StringType, etc. These types are used to form the table schema. As for me, StringType is enough, why do we need others ? Hao -- View this message in context: http://apache-spark-user-list.10

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Update: This syntax is mainly for avoiding retyping column names. Let's take the example in my previous post, where *a* is a table of 15 columns, *b* has 5 columns, after a join, I have a table of (15 + 5 - 1(key in b)) = 19 columns and register the table in sqlContext. I don't want to actually

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, Are you sure that the id/key that you used can access to s3? You can try to use the same id/key through python boto package to test it. Because, I have almost the same situation as yours, but I can access to s3. Best -- View this message in context: http://apache-spark-user-list.1001560.

Re: SparkSQL: select syntax

2014-10-14 Thread Gen
Hi, I met the same problem before, and according to Matei Zaharia: /The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd

Steps to connect BI Tools with Spark SQL using Thrift JDBC server

2014-10-14 Thread Neeraj Garg02
Hi Everybody, I'm looking for information on possible Thrift JDBC/ODBC clients and Thrift JDBC/ODBC servers. For example, how to connect Tableu or other BI/ analytics tools to Spark using ODBC/JDBC servers as mentioned over the Spark SQL page. Thanks and Regards, Neeraj Garg

Re: SparkSQL: select syntax

2014-10-14 Thread Hao Ren
Thank you, Gen. I will give hiveContext a try. =) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-select-syntax-tp16299p16368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: S3 Bucket Access

2014-10-14 Thread Akhil Das
Try the following: 1. Set the access key and secret key in the sparkContext: sparkContext.set(" > ​ > AWS_ACCESS_KEY_ID",yourAccessKey) sparkContext.set(" > ​ > AWS_SECRET_ACCESS_KEY",yourSecretKey) 2. Set the access key and secret key in the environment before starting your application: ​ >

Re: S3 Bucket Access

2014-10-14 Thread Rafal Kwasny
Hi, keep in mind that you're going to have a bad time if your secret key contains a "/" This is due to old and stupid hadoop bug: https://issues.apache.org/jira/browse/HADOOP-3733 Best way is to regenerate the key so it does not include a "/" /Raf Akhil Das wrote: > Try the following: > > 1. Set

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-14 Thread Pei-Lun Lee
I created https://issues.apache.org/jira/browse/SPARK-3947 On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust wrote: > Its not on the roadmap for 1.2. I'd suggest opening a JIRA. > > On Mon, Oct 13, 2014 at 4:28 AM, Pierre B < > pierre.borckm...@realimpactanalytics.com> wrote: > >> Is it planned

YARN deployment of Spark and Thrift JDBC server

2014-10-14 Thread Neeraj Garg02
Hi All, I've downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4. Now, I want to test two features of Spark: 1. YARN deployment : As per my understanding, I need to modify "spark-defaults.conf" file with the settings mentioned at URL http://spark.apache.org/docs/1.1.0/ru

Default spark.deploy.recoveryMode

2014-10-14 Thread Priya Ch
Hi Spark users/experts, In Spark source code (Master.scala & Worker.scala), when registering the worker with master, I see the usage of *persistenceEngine*. When we don't specify spark.deploy.recovery mode explicitly, what is the default value used ? This recovery mode is used to persists and re

Re: a hivectx insertinto issue

2014-10-14 Thread valgrind_girl
It seems that the second problem is dependency issue. and it works exactly like the first one. *this is the complete code:* JavaSchemaRDD schemas=ctx.jsonRDD(arg0); schemas.insertInto("test", true); JavaSchemaRDD teeagers=ctx.hql("SELECT a,b FROM test"); List teeagerNames1=teeagers.

Re: Spark can't find jars

2014-10-14 Thread Christophe Préaud
Hello, I have already posted a message with the exact same problem, and proposed a patch (the subject is "Application failure in yarn-cluster mode"). Can you test it, and see if it works for you? I would be glad too if someone can confirm that it is a bug in Spark 1.1.0. Regards, Christophe. On

Re: a hivectx insertinto issue

2014-10-14 Thread valgrind_girl
sorry ,a mistake by me. the above code generate a result exactly like the one seen from hive. NOW my question is can a hive table be applied to insertinto function? why I keep geting 111,NULL instead of 111,222 -- View this message in context: http://apache-spark-user-list.1001560.n3.na

"Initial job has not accepted any resources" when launching SparkPi example on a worker.

2014-10-14 Thread Theodore Si
Hi all, I have two nodes, one as master(*host1*) and the other as worker(*host2*). I am using the standalone mode. After starting the master on host1, I run $ export MASTER=spark://host1:7077 $ bin/run-example SparkPi 10 on host2, but I get this: 14/10/14 21:54:23 WARN TaskSchedulerImpl: Initi

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2014-10-14 Thread neeraj
I'm trying to get some workaround for this issue. Thanks and Regards, Neeraj Garg From: H4ml3t [via Apache Spark User List] [mailto:ml-node+s1001560n16379...@n3.nabble.com] Sent: Tuesday, October 14, 2014 6:53 PM To: Neeraj Garg02 Subject: Re: Running Spark/YARN on AWS EMR - Issues finding file

Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Sorry for bringing this out again, as I have no clue what could have caused this. I turned on DEBUG logging and did see the jar containing the SerDe class was scanned. More interestingly, I saw the same exception (org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attribut

Re: Steps to connect BI Tools with Spark SQL using Thrift JDBC server

2014-10-14 Thread Cheng Lian
Denny Lee wrote an awesome article on how to connect to Tableau to Spark SQL recently: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql On 10/14/14 6:10 PM, Neeraj Garg02 wrote: Hi Everybody, I’m looking for information on possible Thrift JDBC/ODBC clients and Thrift JDBC/ODBC

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-14 Thread Cheng Lian
On 10/14/14 7:31 PM, Neeraj Garg02 wrote: Hi All, I’ve downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4. Now, I want to test two features of Spark: |1.|*YARN deployment* : As per my understanding, I need to modify “spark-defaults.conf” file with the settings mentioned a

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
Thanks Michael. We are running 1.1 and I believe that is the latest release? I am getting this error when I tried doing what you suggested: org.apache.spark.sql.parquet.ParquetTypesConverter$ ParquetTypes.scalajava.lang.RuntimeException: [2.1] failure: ``UNCACHE'' expected but identifier CREATE fo

Re: persist table schema in spark-sql

2014-10-14 Thread sadhan
I realized my mistake of not using hiveContext. So that error is gone but now I am getting this error: == FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Unknown field info: binary -- View this message i

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.apache.org

TF-IDF in Spark 1.1.0

2014-10-14 Thread Burke Webster
I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apa

foreachPartition and task status

2014-10-14 Thread Salman Haq
Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know that the tasks have completed for all partitions? Thanks, Salman

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the pointers. I verified that the access key-id/secret used are valid. However, the secret may contain "/" at times. The issues I am facing are as follows: - The EC2 instances are setup with an IAMRole () and don't have a static key-id/secret - All of the EC2 instances have acc

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara
Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq wrote: > Hi, > > In my application, I am successfully using foreachPartition to write large > amounts of data into a Cassandra database. > > What is the recommended practice if the application wants to know that the > ta

RE: Spark Cluster health check

2014-10-14 Thread Tarun Garg
Thanks for your response, it is not about infrastructure because I am using EC2 machines and Amazon cloud watch can provide EC2 nodes cpu usage, memory usage details but I need to send notification in situation like processing delay, total delay, Maximum rate is low,etc. Tarun Date: Tue, 14 Oc

Re: Spark can't find jars

2014-10-14 Thread Jimmy McErlain
So the only way that I could make this work was to build a fat jar file as suggested earlier. To me (and I am no expert) it seems like this is a bug. Everything was working for me prior to our upgrade to Spark 1.1 on Hadoop 2.2 but now it seems to not... ie packaging my jars locally then pushing

Re: SparkSQL: StringType for numeric comparison

2014-10-14 Thread Michael Armbrust
Its much more efficient to store and compute on numeric types than string types. On Tue, Oct 14, 2014 at 1:25 AM, invkrh wrote: > Thank you, Michael. > > In Spark SQL DataType, we have a lot of types, for example, ByteType, > ShortType, StringType, etc. > > These types are used to form the table

spark 1.1.0/yarn hang

2014-10-14 Thread tian zhang
Hi, I have spark 1.1.0 yarn installation. I am using spark-submit to run a simple application. >From the console output, I have 769 partitions and after task 768 in stage 0 >(count) finished, it hangs. I used jstack to dump the stacktop and it shows it is waiting ... Any suggestion what might go

Re: S3 Bucket Access

2014-10-14 Thread Gen
Hi, If I remember well, spark cannot use the IAMrole credentials to access to s3. It use at first the id/key in the environment. If it is null in the environment, it use the value in the file core-site.xml. So, IAMrole is not useful for spark. The same problem happens if you want to use distcp co

Re: foreachPartition and task status

2014-10-14 Thread Salman Haq
On Tue, Oct 14, 2014 at 12:42 PM, Sean McNamara wrote: > Are you using spark streaming? > > No, not at this time.

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the input. Yes, I did use the "temporary" access credentials provided by the IAM role (also detailed in the link you provided). The session token needs to be specified and I was looking for a way to set that in the header (which doesn't seem possible). Looks like a static key/secret is t

Re: Spark Cluster health check

2014-10-14 Thread Akhil Das
Yes, for that you can tweak nagios a bit or you can write a custom monitoring applicaton which will check the processing delay etc. Thanks Best Regards On Tue, Oct 14, 2014 at 10:16 PM, Tarun Garg wrote: > Thanks for your response, it is not about infrastructure because I am > using EC2 machine

RE: Spark Cluster health check

2014-10-14 Thread Tarun Garg
I am planning to write custom monitoring application for that i analyse org.apache.spark.streaming.scheduler.StreamingListener, is there another spark streaming api which can give me the insight of the cluster like total processing time, delay, etc Tarun Date: Tue, 14 Oct 2014 23:19:47 +0530 S

Re: SparkStreaming program does not start

2014-10-14 Thread spr
Thanks Abraham Jacob, Tobias Pfeiffer, Akhil Das-2, and Sean Owen for your helpful comments. Cockpit error on my part in just putting the .scala file as an argument rather than redirecting stdin from it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spa

Re: S3 Bucket Access

2014-10-14 Thread Ranga
One related question. Could I specify the " com.amazonaws.services.s3.AmazonS3Client" implementation for the " fs.s3.impl" parameter? Let me try that and update this thread with my findings. On Tue, Oct 14, 2014 at 10:48 AM, Ranga wrote: > Thanks for the input. > Yes, I did use the "temporary"

RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi All, I have a Graph with millions of edges. Edges are represented by org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = MappedRDD[4] . I have two questions : 1)How can I fetch all the nodes with a given edge label ( all edges with a given property ) 2) Is it possible to create i

spark sql union all is slow

2014-10-14 Thread shuluster
I have many tables of same schema, they are partitioned by time. For example one id could be in many of those table. I would like to find aggregation of such ids. Originally these tables are located on HDFS as files. Once table schemaRDD is loaded, I cacheTable on them. Each table is around 30m - 1

Re: Spark Streaming

2014-10-14 Thread st553
Have you solved this issue? im also wondering how to stream an existing file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-tp14306p16406.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

SPARK_SUBMIT_CLASSPATH question

2014-10-14 Thread Greg Hill
It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as other tools to put wildcards in the paths you add. For some reason it doesn't pick up the classpath information from yarn-site.xml either, it seems, when running on YARN. I'm having to manually add every single depe

Re: graphx - mutable?

2014-10-14 Thread ll
hi again. just want to check in again to see if anyone could advise on how to implement a "mutable, growing graph" with graphx? we're building a graph is growing over time. it adds more vertices and edges every iteration of our algorithm. it doesn't look like there is an obvious way to add a

Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-14 Thread SK
Hi, I am trying to implement simple sentiment analysis of Twitter streams in Spark/Scala. I am getting an exception and it appears when I combine SparkContext with StreamingContext in the same program. When I read the positive and negative words using only SparkContext.textFile (without creating

mllib CoordinateMatrix

2014-10-14 Thread ll
after creating a coordinate matrix from my rdd[matrixentry]... 1. how can i get/query the value at coordiate (i, j)? 2. how can i set/update the value at coordiate (i, j)? 3. how can i get all the values on a specific row i, ideally as a vector? thanks! -- View this message in context:

Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys, I am new to Spark. When I run Spark Kmeans (org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works great. However, when using a large dataset with 1.5 million vectors, it just hangs there at some reducyByKey/collectAsMap stages (attached image shows the corresponding UI).

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > hi again. just want to check in again to see if anyone could advise on how > to implement a "mutable, growing graph" with graphx? > > we're building a graph is growing over time. it adds more vertices and > edges every iteration of our algorithm. >

A question about streaming throughput

2014-10-14 Thread danilopds
Hi, I'm learning about Apache Spark Streaming and I'm doing some tests. Now, I have a modified version of the app NetworkWordCount that perform a /reduceByKeyAndWindow/ with window of 10 seconds in intervals of 5 seconds. I'm using also the function to measure the rate of records/second like thi

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys, An interesting thing, for the input dataset which has 1.5 million vectors, if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned above. However, if decrease k_value = 10, the same error still appears in the log but the application finished successfully, without observa

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > >> hi again. just want to check in again to see if a

Re: TF-IDF in Spark 1.1.0

2014-10-14 Thread Xiangrui Meng
You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val tfWithI

Re: mllib CoordinateMatrix

2014-10-14 Thread Reza Zadeh
Hello, CoordinateMatrix is in its infancy, and right now is only a placeholder. To get/set the value at (i,j), you should map the entries rdd using the usual rdd map operation, and change the relevant entries. To get the values on a specific row, you can call toIndexedRowMatrix(), which returns

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
What is the feature dimension? I saw you used 100 partitions. How many cores does your cluster have? -Xiangrui On Tue, Oct 14, 2014 at 1:51 PM, Ray wrote: > Hi guys, > > An interesting thing, for the input dataset which has 1.5 million vectors, > if set the KMeans's k_value = 100 or k_value = 50,

Re: graphx - mutable?

2014-10-14 Thread Ankur Dave
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh wrote: > a related question, what is the best way to update the values of existing > vertices and edges? > Many of the Graph methods deal with updating the existing values in bulk, including mapVertices, mapEdges, mapTriplets, mapReduceTriplets, and out

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh > wrote: > >> a related question, what is the best way to update the values of existing >> vertices and edges? >> > > Many of the Graph methods deal with updating the existing values

Re: mllib CoordinateMatrix

2014-10-14 Thread Duy Huynh
thanks reza! On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh wrote: > Hello, > > CoordinateMatrix is in its infancy, and right now is only a placeholder. > > To get/set the value at (i,j), you should map the entries rdd using the > usual rdd map operation, and change the relevant entries. > > To get

submitted uber-jar not seeing spark-assembly.jar at worker

2014-10-14 Thread Tamas Sandor
Hi, I'm rookie in spark, but hope someone can help me out. I'm writing an app that I'm submitting to my spark-master that has a worker on a separate node. It uses spark-cassandra-connector, and since it depends on guava-v16 and it conflicts with the default spark-1.1.0-assembly's guava-v14.1 I bui

Re: S3 Bucket Access

2014-10-14 Thread Rishi Pidva
As per EMR documentation: http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html Access AWS Resources Using IAM Roles If you've launched your cluster with an IAM role, applications running on the EC2 instances of that cluster can use the IAM role to obtain temp

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks Rishi. That is exactly what I am trying to do now :) On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva wrote: > > As per EMR documentation: > http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html > Access AWS Resources Using IAM Roles > > If you've launched y

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui, The input dataset has 1.5 million sparse vectors. Each sparse vector has a dimension(cardinality) of 9153 and has less than 15 nonzero elements. Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can see the application got 201 vCores. From the spark UI, I can

Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-14 Thread Michael Campbell
Hey all, I'm trying a very basic spark SQL job and apologies as I'm new to a lot of this, but I'm getting this failure: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.SchemaRDD.take(I)[Lorg/apache/spark/sql/catalyst/expressions/Row; I've tried a variety of uber-jar c

MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Aris
Hi folks, When I am predicting Binary 1/0 responses with LogsticRegressionWithSGD, it returns a LogisticRegressionModel. In Spark 1.0.X I was using the clearThreshold method on the model to get the raw predicted probabilities when I ran the predict() method... It appears now that rather than gett

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot of

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Aris
Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold() DOES IN FACT work. It appears the the LogsticRegressionWithSGD returns a model whose method is broken!! On Tue, Oct 14, 2014 at 3:14 PM, Aris wrote: > Hi folks, > > When I am predicting Binary 1/0 responses with LogsticRe

rule engine based on spark

2014-10-14 Thread salemi
hi, is the a rule engine based on spark? i like to allow the business user to define their rules in a language and the execution of the rules should be done in spark. Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rule-engine-based-on-spark-t

Re: MLlib - Does LogisticRegressionModel.clearThreshold() no longer work?

2014-10-14 Thread Xiangrui Meng
LBFGS is better. If you data is easily separable, LR might return values very close or equal to either 0.0 or 1.0. It is rare but it may happen. -Xiangrui On Tue, Oct 14, 2014 at 3:18 PM, Aris wrote: > Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold() > DOES IN FACT work.

Re: Spark Streaming

2014-10-14 Thread hutashan
Yes.. I solved this problem using fileStream instead of textfile StreamingContext scc = new StreamingContext(conf, new Duration(1)); ClassTag k = ClassTag$.MODULE$.apply(LongWritable.class); ClassTag v = ClassTag$.MODULE$.apply(Text.class);

Re: RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi, With respect to the first issue, one possible way is to filter the graph via 'graph.subgraph(epred = e => e.attr == "edgeLabel")' , but I am still curious if we can index RDDs. Warm Regards Soumitra On Tue, Oct 14, 2014 at 2:46 PM, Soumitra Johri < soumitra.siddha...@gmail.com> wrote: >

Re: Does SparkSQL work with custom defined SerDe?

2014-10-14 Thread Chen Song
Looks like it may be related to https://issues.apache.org/jira/browse/SPARK-3807. I will build from branch 1.1 to see if the issue is resolved. Chen On Tue, Oct 14, 2014 at 10:33 AM, Chen Song wrote: > Sorry for bringing this out again, as I have no clue what could have > caused this. > > I tu

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Burak, In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1. In the current test, I increase num-executors = 200. In the storage info 2 (as shown below), 11 executors are used (I think the data is kind of balanced) and others have zero memory usage.

Re: rule engine based on spark

2014-10-14 Thread Mayur Rustagi
We are developing something similar on top of Streaming. Could you detail some rule functionality you are looking for. We are developing a dsl for data processing on top of streaming as well as static data enabled on Spark. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanal

Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
Hi All, I am trying to understand what is going on in my simple WordCount Spark Streaming application. Here is the setup - I have a Kafka producer that is streaming words (lines of text). On the flip side, I have a spark streaming application that uses the high-level Kafka/Spark connector to read

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedi

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50. It worked fine. Ray, the error log you posted is after cluster termination, which is not the root cause. Could you search your log and find the real cause? On the executor tab screenshot, I saw only 200MB is used. Did you cache th

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
Yeah... it totally should be... There is nothing fancy in there - import org.apache.spark.api.java.function.Function2; public class ReduceWords implements Function2 { private static final long serialVersionUID = -6076139388549335886L; public Integer call(Integer first, Integer second){ return

How to I get at a SparkContext or better a JavaSparkContext from the middle of a function

2014-10-14 Thread Steve Lewis
I am running a couple of functions on an RDD which require access to data on the file system known to the context. If I create a class with a context a a member variable I get a serialization error, So I am running my function on some slave and I want to read in data from a Path defined by a stri

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
The results are no different - import org.apache.spark.api.java.function.Function2; import java.io.Serializable; public class ReduceWords implements Serializable, Function2 { private static final long serialVersionUID = -6076139388549335886L; public Integer call(Integer first, Integer second){

Spark output to s3 extremely slow

2014-10-14 Thread anny9699
Hi, I found writing output back to s3 using rdd.saveAsTextFile() is extremely slow, much slower than reading from s3. Is there a way to make it faster? The rdd has 150 partitions so parallelism is enough I assume. Thanks a lot! Anny -- View this message in context: http://apache-spark-user-li

How to create Track per vehicle using spark RDD

2014-10-14 Thread Manas Kar
Hi, I have an RDD containing Vehicle Number , timestamp, Position. I want to get the "lag" function equivalent to my RDD to be able to create track segment of each Vehicle. Any help? PS: I have tried reduceByKey and then splitting the List of position in tuples. For me it runs out of memory eve

Re: How to patch sparkSQL on EC2?

2014-10-14 Thread Christos Kozanitis
ahhh never mind… I didn’t notice that a spark-assembly jar file gets produced after compiling the whole spark suite… So no more manual editing of the jar file of the AMI for now! Christos On Oct 10, 2014, at 12:15 AM, Christos Kozanitis wrote: > Hi > > I have written a few extensions for sp

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause. KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO DAGSchedule

something about rdd.collect

2014-10-14 Thread randylu
My code is as follows: *documents.flatMap(case words => words.map(w => (w, 1))).reduceByKey(_ + _).collect()* In driver's log, reduceByKey() is finished, but collect() seems always in run, just can't be finished. In additional, there are about 200,000,000 words needs to be collected. Is it t

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu wrote: > Hi Michael, > > That worked for me. At least I’m now further than I was. Thanks for the > tip! > > -Terry > > From: Michael Armbrust > Date: Monday, October 13, 2014

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin
Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. --  Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu (randyl.

kryos serializer

2014-10-14 Thread dizzy5112
Hi all, how can i tell if my kryos serializer is actually working. I have a class which extends Serializable and i have included the following imports: import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator i also have class included MyRegistrator extends KryoReg

How to write data into Hive partitioned Parquet table?

2014-10-14 Thread Banias H
Hi, I am still new to Spark. Sorry if similar questions are asked here before. I am trying to read a Hive table; then run a query and save the result into a Hive partitioned Parquet table. For example, I was able to run the following in Hive: INSERT INTO TABLE target_table PARTITION (partition_fi

pyspark - extract 1 field from string

2014-10-14 Thread Chop
I'm stumped with how to take 1 RDD that has lines like: 4,01012009,00:00,1289,4 5,01012009,00:00,1326,4 6,01012009,00:00,1497,7 and produce a new RDD with just the 4th field from each line (1289, 1326, 1497) I don't want to apply a conditional, I just want to grab that one field from each lin

Re: spark sql union all is slow

2014-10-14 Thread Pei-Lun Lee
Hi, You can merge them into one table by: sqlContext.unionAll(sqlContext.unionAll(sqlContext.table("table_1"), sqlContext.table("table_2")), sqlContext.table("table_3")).registarTempTable("table_all") Or load them in one call by: sqlContext.parquetFile("table_1.parquet,table_2.parquet,table_3.p

Re: something about rdd.collect

2014-10-14 Thread randylu
Thanks rxin, I still have a doubt about collect(). Word's number before reduceByKey() is about 200 million, and after reduceByKey() it decreases to 18 million. Memory for driver is initialized 15GB, then I print out "runtime.freeMemory()" before reduceByKey(), it indicates 13GB free memory.

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-14 Thread Ilya Ganelin
Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, "Ilya Ganelin" wrote: > Hi Akhil - I tried your suggestions and tried varying my partition sizes. > Reducing the number of partitions led to memory errors (pre

Re: something about rdd.collect

2014-10-14 Thread randylu
If memory is not enough, OutOfMemory exception should occur, but nothing in driver's log. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/something-about-rdd-collect-tp16451p16461.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark - extract 1 field from string

2014-10-14 Thread Davies Liu
rdd.map(lambda line: int(line.split(',')[3])) On Tue, Oct 14, 2014 at 6:58 PM, Chop wrote: > I'm stumped with how to take 1 RDD that has lines like: > > 4,01012009,00:00,1289,4 > 5,01012009,00:00,1326,4 > 6,01012009,00:00,1497,7 > > and produce a new RDD with just the 4th field from each line

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
I used "k-means||", which is the default. And it took less than 1 minute to finish. 50 iterations took less than 25 minutes on a cluster of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it yarn-client? -Xiangrui On Tue, Oct 14, 2014 at 6:03 PM, Ray wrote: > Hi Xiangrui, > > Thanks for

Re: How to create Track per vehicle using spark RDD

2014-10-14 Thread Mohit Singh
Perhaps, its just me but "lag" function isnt familiar to me .. But have you tried configuring the spark appropriately http://spark.apache.org/docs/latest/configuration.html On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar wrote: > Hi, > I have an RDD containing Vehicle Number , timestamp, Position.

Re: A question about streaming throughput

2014-10-14 Thread Sean Owen
Hm, is this not just showing that you're rate-limited by how fast you can get events to the cluster? you have more network bottleneck between the data source and cluster in the cloud than your local cluster. On Tue, Oct 14, 2014 at 9:44 PM, danilopds wrote: > Hi, > I'm learning about Apache Spark

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Sean Owen
The problem is not ReduceWords, since it is already Serializable by implementing Function2. Indeed the error tells you just what is unserializable: KafkaStreamingWordCount, your driver class. Something is causing a reference to the containing class to be serialized in the closure. The best fix is

adding element into MutableList throws an error type mismatch

2014-10-14 Thread Henry Hung
Hi All, Could someone shed a light to why when adding element into MutableList can result in type mistmatch, even if I'm sure that the class type is right? Below is the sample code I run in spark 1.0.2 console, at the end of line, there is an error type mismatch: Welcome to

Re: adding element into MutableList throws an error type mismatch

2014-10-14 Thread Sean Owen
Another instance of https://issues.apache.org/jira/browse/SPARK-1199 , fixed in subsequent versions. On Wed, Oct 15, 2014 at 7:40 AM, Henry Hung wrote: > Hi All, > > > > Could someone shed a light to why when adding element into MutableList can > result in type mistmatch, even if I’m sure that th

Re: "Initial job has not accepted any resources" when launching SparkPi example on a worker.

2014-10-14 Thread Theodore Si
Can anyone help me, please? 在 10/14/2014 9:58 PM, Theodore Si 写道: Hi all, I have two nodes, one as master(*host1*) and the other as worker(*host2*). I am using the standalone mode. After starting the master on host1, I run $ export MASTER=spark://host1:7077 $ bin/run-example SparkPi 10 on hos

  1   2   >