Re: Can a Spark Driver Program be a REST Service by itself?
You can try using Spark Jobserver https://github.com/spark-jobserver/spark-jobserver On Wed, Jul 1, 2015 at 4:32 PM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: Folks, My Use case is as follows: My Driver program will be aggregating a bunch of Event Streams and acting on it. The Action on the aggregated events is configurable and can change dynamically. One way I can think of is to run the Spark Driver as a Service where a config push can be caught via an API that the Driver exports. Can I have a Spark Driver Program run as a REST Service by itself? Is this a common use case? Is there a better way to solve my problem? Thanks -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How many executors can I acquire in standalone mode ?
I believe you would be restricted by the number of cores you have in your cluster. Having a worker running without a core is useless. On Tue, May 26, 2015 at 3:04 PM, canan chen ccn...@gmail.com wrote: In spark standalone mode, there will be one executor per worker. I am wondering how many executor can I acquire when I submit app ? Is it greedy mode (as many as I can acquire )? -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How does spark manage the memory of executor with multiple tasks
Hi Evo, Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you would be able to run multiple executors on the same JVM/worker. https://issues.apache.org/jira/browse/SPARK-1706. Thanks Arush On Tue, May 26, 2015 at 2:54 PM, canan chen ccn...@gmail.com wrote: I think the concept of task in spark should be on the same level of task in MR. Usually in MR, we need to specify the memory the each mapper/reducer task. And I believe executor is not a user-facing concept, it's a spark internal concept. For spark users they don't need to know the concept of executor, but need to know the concept of task. On Tue, May 26, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com wrote: This is the first time I hear that “one can specify the RAM per task” – the RAM is granted per Executor (JVM). On the other hand each Task operates on ONE RDD Partition – so you can say that this is “the RAM allocated to the Task to process” – but it is still within the boundaries allocated to the Executor (JVM) within which the Task is running. Also while running, any Task like any JVM Thread can request as much additional RAM e.g. for new Object instances as there is available in the Executor aka JVM Heap *From:* canan chen [mailto:ccn...@gmail.com] *Sent:* Tuesday, May 26, 2015 9:30 AM *To:* Evo Eftimov *Cc:* user@spark.apache.org *Subject:* Re: How does spark manage the memory of executor with multiple tasks Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task if thread level and multiple tasks runs in the same executor. And even I don't know how many threads there will be. Besides that, if one task cause OOM, it would cause other tasks in the same executor fail too. There's no isolation between tasks. On Tue, May 26, 2015 at 4:15 PM, Evo Eftimov evo.efti...@isecc.com wrote: An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use to the Executor Ps: what creates ongoing confusion here is that the Spark folks have “invented” their own terms to describe the design of their what is essentially a Distributed OO Framework facilitating Parallel Programming and Data Management in a Distributed Environment, BUT have not provided clear dictionary/explanations linking these “inventions” with standard concepts familiar to every Java, Scala etc developer *From:* canan chen [mailto:ccn...@gmail.com] *Sent:* Tuesday, May 26, 2015 9:02 AM *To:* user@spark.apache.org *Subject:* How does spark manage the memory of executor with multiple tasks Since spark can run multiple tasks in one executor, so I am curious to know how does spark manage memory across these tasks. Say if one executor takes 1GB memory, then if this executor can run 10 tasks simultaneously, then each task can consume 100MB on average. Do I understand it correctly ? It doesn't make sense to me that spark run multiple tasks in one executor. -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Websphere MQ as a data source for Apache Spark Streaming
Hi Umesh, You can connect to Spark Streaming with MQTT refer to the example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala Thanks Arush On Mon, May 25, 2015 at 3:43 PM, umesh9794 umesh.chaudh...@searshc.com wrote: I was digging into the possibilities for Websphere MQ as a data source for spark-streaming becuase it is needed in one of our use case. I got to know that MQTT http://mqtt.org/ is the protocol that supports the communication from MQ data structures but since I am a newbie to spark streaming I need some working examples for the same. Did anyone try to connect the MQ with spark streaming. Please devise the best way for doing so. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark SQL performance issue.
Hi Can you share your Web UI, depicting your task level breakup.I can see many thing s that can be improved. 1. JavaRDDPerson rdds = ...rdds.cache(); -this caching is not needed as you are not reading the rdd for any action 2.Instead of collecting as list, if you can save as text file, it would be better. As it would avoid moving results to the driver. Thanks Arush On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right? Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com: Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply a schema to an RDD and register table. JavaRDDPerson rdds = ... rdds.cache(); DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class); dataFrame.registerTempTable(person); sqlContext.cacheTable(person); Run sql query. sqlContext.sql(SELECT id, name, salary FROM person WHERE salary = YYY AND salary = XXX).collectAsList() I launch standalone cluster which contains 4 workers. Each node runs on machine with 8 CPU and 15 Gb memory. When I run the query on the environment over RDD which contains 1 million persons it takes 1 minute. Somebody can tell me how to tuning the performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: problem with spark thrift server
Hi What do you mean disable the driver? what are you trying to achieve. Thanks Arush On Thu, Apr 23, 2015 at 12:29 PM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi , I have a question about spark thrift server , i deployed the spark on yarn and found if the spark driver disable , the spark application will be crashed on yarn. appreciate for any suggestions and idea . Thank you! -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
[SQL] DROP TABLE should also uncache table
Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: [SQL] DROP TABLE should also uncache table
Yes, i am able to reproduce the problem. Do you need the scripts to create the tables? On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai yh...@databricks.com wrote: Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Can spark sql read existing tables created in hive
Seems Spark SQL accesses some more columns apart from those created by hive. You can always recreate the tables, you would need to execute the table creation scripts but it would be good to avoid recreation. On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I did copy hive-conf.xml form Hive installation into spark-home/conf. IT does have all the meta store connection details, host, username, passwd, driver and others. Snippet == configuration property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://host.vip.company.com:3306/HDB/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionUserName/name valuehiveuser/value descriptionusername to use against metastore database/description /property property namejavax.jdo.option.ConnectionPassword/name valuesome-password/value descriptionpassword to use against metastore database/description /property property namehive.metastore.local/name valuefalse/value descriptioncontrols whether to connect to remove metastore server or open a new metastore server in Hive Client JVM/description /property property namehive.metastore.warehouse.dir/name value/user/hive/warehouse/value descriptionlocation of default database for the warehouse/description /property .. When i attempt to read hive table, it does not work. dw_bid does not exists. I am sure there is a way to read tables stored in HDFS (Hive) from Spark SQL. Otherwise how would anyone do analytics since the source tables are always either persisted directly on HDFS or through Hive. On Fri, Mar 27, 2015 at 1:15 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Since hive and spark SQL internally use HDFS and Hive metastore. The only thing you want to change is the processing engine. You can try to bring your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive site xml captures the metastore connection details). Its a hack, i havnt tried it. I have played around with the metastore and it should work. On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have few tables that are created in Hive. I wan to transform data stored in these Hive tables using Spark SQL. Is this even possible ? So far i have seen that i can create new tables using Spark SQL dialect. However when i run show tables or do desc hive_table it says table not found. I am now wondering is this support present or not in Spark SQL ? -- Deepak -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- Deepak -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Can spark sql read existing tables created in hive
Since hive and spark SQL internally use HDFS and Hive metastore. The only thing you want to change is the processing engine. You can try to bring your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive site xml captures the metastore connection details). Its a hack, i havnt tried it. I have played around with the metastore and it should work. On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have few tables that are created in Hive. I wan to transform data stored in these Hive tables using Spark SQL. Is this even possible ? So far i have seen that i can create new tables using Spark SQL dialect. However when i run show tables or do desc hive_table it says table not found. I am now wondering is this support present or not in Spark SQL ? -- Deepak -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Checking Data Integrity in Spark
Its not possible to configure Spark to do checks based on xmls. You would need to write jobs to do the validations you need. On Fri, Mar 27, 2015 at 5:13 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello, I want to check if there is any way to check the data integrity of the data files. The use case is perform data integrity check on large files 100+ columns and reject records (write it another file) that does not meet criteria's (such as NOT NULL, date format, etc). Since there are lot of columns/integrity rules we should able to data integrity check through configurations (like xml, json, etc); Please share your thoughts.. Thanks Sathish -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Windowing and Analytics Functions in Spark SQL
You can look at the Spark SQL programming guide. http://spark.apache.org/docs/1.3.0/sql-programming-guide.html and the Spark API. http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package On Thu, Mar 26, 2015 at 5:21 PM, Masf masfwo...@gmail.com wrote: Ok, Thanks. Some web resource where I could check the functionality supported by Spark SQL? Thanks!!! Regards. Miguel Ángel. On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian lian.cs@gmail.com wrote: We're working together with AsiaInfo on this. Possibly will deliver an initial version of window function support in 1.4.0. But it's not a promise yet. Cheng On 3/26/15 7:27 PM, Arush Kharbanda wrote: Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote: Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark SQL? Thanks!!! -- Regards. Miguel Ángel -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- Saludos. Miguel Ángel -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Windowing and Analytics Functions in Spark SQL
Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote: Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark SQL? Thanks!!! -- Regards. Miguel Ángel -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: JavaKinesisWordCountASLYARN Example not working on EMR
with 265.4 MB RAM, BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982) 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager *Exception in thread main java.lang.NullPointerException* *at org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:581)* at org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32) at org.apache.spark.SparkContext.(SparkContext.scala:541) at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642) at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:75) at org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:132) at org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN.main(JavaKinesisWordCountASLYARN.java:127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaKinesisWordCountASLYARN-Example-not-working-on-EMR-tp6.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark SQL: Day of month from Timestamp
Hi You can use functions like year(date),month(date) Thanks Arush On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan harut.martiros...@gmail.com wrote: Hi guys. Basically, we had to define a UDF that does that, is there a built in function that we can use for it? -- RGRDZ Harut -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark sql thrift server slower than hive
A basis change needed by spark is setting the executor memory which defaults to 512MB by default. On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee denny.g@gmail.com wrote: How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote: We have cloudera CDH 5.3 installed on one machine. We are trying to use spark sql thrift server to execute some analysis queries against hive table. Without any changes in the configurations, we run the following query on both hive and spark sql thrift server *select * from tableName;* The time taken by spark is larger than the time taken by hive which is not supposed to be the like that. The hive table is mapped to json files stored on HDFS directory and we are using *org.openx.data.jsonserde.JsonSerDe* for serialization/deserialization. Why spark takes much more time to execute the query than hive ? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than- hive-tp22177.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Idempotent count
Hi Binh, It stores the state as well the unprocessed data. It is a subset of the records that you aggregated so far. This provides a good reference for checkpointing. http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van binhn...@gmail.com wrote: Hi Arush, Thank you for answering! When you say checkpoints hold metadata and Data, what is the Data? Is it the Data that is pulled from input source or is it the state? If it is state then is it the same number of records that I aggregated since beginning or only a subset of it? How can I limit the size of state that is kept in checkpoint? Thank you -Binh On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Yes spark streaming is capable of stateful stream processing. With or without state is a way of classifying state. Checkpoints hold metadata and Data. Thanks On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com wrote: Hi all, I am new to Spark so please forgive me if my questions is stupid. I am trying to use Spark-Streaming in an application that read data from a queue (Kafka) and do some aggregation (sum, count..) and then persist result to an external storage system (MySQL, VoltDB...) From my understanding of Spark-Streaming, I can have two ways of doing aggregation: - Stateless: I don't have to keep state and just apply new delta values to the external system. From my understanding, doing in this way I may end up with over counting when there is failure and replay. - Statefull: Use checkpoint to keep state and blindly save new state to external system. Doing in this way I have correct aggregation result but I have to keep data in two places (state and external system) My questions are: - Is my understanding of Stateless and Statefull aggregation correct? If not please correct me! - For the Statefull aggregation, What does Spark-Streaming keep when it saves checkpoint? Please kindly help! Thanks -Binh -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: sparksql native jdbc driver
Yes, I have been using Spark SQL from the onset. Haven't found any other Server for Spark SQL for JDBC connectivity. On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Idempotent count
Hi Yes spark streaming is capable of stateful stream processing. With or without state is a way of classifying state. Checkpoints hold metadata and Data. Thanks On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com wrote: Hi all, I am new to Spark so please forgive me if my questions is stupid. I am trying to use Spark-Streaming in an application that read data from a queue (Kafka) and do some aggregation (sum, count..) and then persist result to an external storage system (MySQL, VoltDB...) From my understanding of Spark-Streaming, I can have two ways of doing aggregation: - Stateless: I don't have to keep state and just apply new delta values to the external system. From my understanding, doing in this way I may end up with over counting when there is failure and replay. - Statefull: Use checkpoint to keep state and blindly save new state to external system. Doing in this way I have correct aggregation result but I have to keep data in two places (state and external system) My questions are: - Is my understanding of Stateless and Statefull aggregation correct? If not please correct me! - For the Statefull aggregation, What does Spark-Streaming keep when it saves checkpoint? Please kindly help! Thanks -Binh -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Hive on Spark with Spark as a service on CDH5.2
Hive on Spark and accessing HiveContext from the shall are seperate things. Hive on Spark - https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started To access hive on Spark you need to built with -Phive. http://spark.apache.org/docs/1.2.1/building-spark.html#building-with-hive-and-jdbc-support On Tue, Mar 17, 2015 at 11:35 AM, anu anamika.guo...@gmail.com wrote: *I am not clear if spark sql supports HIve on Spark when spark is run as a service in CDH 5.2? * Can someone please clarify this. If this is possible, how what configuration changes have I to make to import hive context in spark shell as well as to be able to do a spark-submit for the job to be run on the entire cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-with-Spark-as-a-service-on-CDH5-2-tp22091.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank
You can track the issue here. https://issues.apache.org/jira/browse/SPARK-1442 Its currently not supported, i guess the test cases are work in progress. On Mon, Mar 16, 2015 at 12:44 PM, hseagle hsxup...@gmail.com wrote: Hi all, I'm wondering whether the latest spark-1.3.0 supports the windowing and analytic funtions in hive, such as row_number, rank and etc. Indeed, I've done some testing by using spark-shell and found that row_number is not supported yet. But I still found that there were some test case related to row_number and other analytics functions. These test cases is defined in sql/hive/target/scala-2.10/test-classes/ql/src/test/queries/clientpositive/windowing_multipartitioning.q So my question is divided into two parts, One is whether the analytics function is supported or not, the othere one is that if it's not supported why there are still some test cases hseagle -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-1-3-0-support-the-analytic-functions-defined-in-Hive-such-as-row-number-rank-tp22072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: can not submit job to spark in windows
-56b32155-2779-4345-9597-2bfa6a87a51d\pi.py Traceback (most recent call last): File C:/spark-1.2.1-bin-hadoop2.4/bin/pi.py, line 29, in module sc = SparkContext(appName=PythonPi) File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 105, in __ init__ conf, jsc) File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 153, in _d o_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 202, in _i nitialize_context return self._jvm.JavaSparkContext(jconf) File C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g ateway.py, line 701, in __call__ File C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc ol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spa rk.api.java.JavaSparkContext. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: 650) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:445) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1004) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28 8) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28 8) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.SparkContext.init(SparkContext.scala:288) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.sc ala:61) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Sou rce) at java.lang.reflect.Constructor.newInstance(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand .java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) What is wrong on my side? Should I run some scripts before spark-submit.cmd? Regards, Sergey. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-not-submit-job-to-spark-in-windows-tp21824.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How to read from hdfs using spark-shell in Intel hadoop?
You can add resolvers on SBT using resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots; On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the command mvn -Dhadoop.version=1.0.3 clean package and started spark-shell and read a HDFS file using sc.textFile() and the exception is WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to deadNodes and continuejava.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264 remote=/10.88.6.133:50010] The same problem is asked in the this mail. RE: Spark is unable to read from HDFS http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E RE: Spark is unable to read from HDFS http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E Hi, Thanks for the reply. I've tried the below. View on mail-archives.us.apache.org http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E Preview by Yahoo As suggested in the above mail, *In addition to specifying HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to specify the libraryDependencies and name spark-core resolvers. Otherwise, sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can set up your own local or remote repository that you specify * Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can anybody please elaborate on how to specify tat SBT should fetch hadoop-core from Intel which is in our internal repository? Thanks Regards, Meethu M -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Connecting a PHP/Java applications to Spark SQL Thrift Server
For java You can use hive-jdbc connectivity jars to connect to Spark-SQL. The driver is inside the hive-jdbc Jar. *http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html* On Wed, Mar 4, 2015 at 1:26 PM, n...@reactor8.com wrote: SparkSQL supports JDBC/ODBC connectivity, so if that's the route you needed/wanted to connect through you could do so via java/php apps. Havent used either so cant speak to the developer experience, assume its pretty good as would be preferred method for lots of third party enterprise apps/tooling If you prefer using the thrift server/interface, if they don't exist already in open source land you can use thrift definitions to generate client libs in any supported thrift language and use that for connectivity. Seems one issue with thrift-server is when running in cluster mode. Seems like it still exists but UX of error has been cleaned up in 1.3: https://issues.apache.org/jira/browse/SPARK-5176 -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Tuesday, March 3, 2015 11:15 PM To: user@spark.apache.org Subject: Connecting a PHP/Java applications to Spark SQL Thrift Server We have installed hadoop cluster with hive and spark and the spark sql thrift server is up and running without any problem. Now we have set of applications need to use spark sql thrift server to query some data. Some of these applications are java applications and the others are PHP applications. As I am an old fashioned java developer, I used to connect java applications to BD servers like Mysql using a JDBC driver. Is there a corresponding driver for connecting with Spark Sql Thrift server ? Or what is the library I need to use to connect to it? For PHP, what are the ways we can use to connect PHP applications to Spark Sql Thrift Server? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-a-PHP-Java-ap plications-to-Spark-SQL-Thrift-Server-tp21902.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: TreeNodeException: Unresolved attributes
Why don't you formulate a string before you pass it to the hql function (appending strings), and hql function is deprecated. You should use sql. http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur anushas...@gmail.com wrote: Hi, I am trying to run a simple select query on a table. val restaurants=hiveCtx.hql(select * from TableName where column like '%SomeString%' ) This gives an error as below: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: How do I solve this? -- Regards, Anusha -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Unable to submit spark job to mesos cluster
) at com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) ... 18 more ] Exception in thread main java.lang.ExceptionInInitializerError at com.algofusion.reconciliation.execution.ReconExecutionController.initialize(ReconExecutionController.java:257) at com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1454) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1450) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:156) at org.apache.spark.SparkContext.init(SparkContext.scala:203) at com.algofusion.reconciliation.execution.utils.ExecutionUtils.clinit(ExecutionUtils.java:130) ... 2 more Regards, Sarath. -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
Can you share what error you are getting when the job fails. On Thu, Feb 26, 2015 at 4:32 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8 r3.8xlarge machines but limit the job to only 128 cores. I have also tried other things such as setting 4 workers per r3.8xlarge and 67gb each but this made no difference. The job frequently fails at the end in this step (saveasHadoopFile). It will sometimes work. finalNewBaselinePairRDD is hashPartitioned with 1024 partitions and a total size around 1TB. There are about 13.5M records in finalNewBaselinePairRDD. finalNewBaselinePairRDD is String,String JavaPairRDDText, Text finalBaselineRDDWritable = finalNewBaselinePairRDD.mapToPair(new ConvertToWritableTypes()).persist(StorageLevel.MEMORY_AND_DISK_SER()); // Save to hdfs (gzip) finalBaselineRDDWritable.saveAsHadoopFile(hdfs:///sparksync/, Text.class, Text.class, SequenceFileOutputFormat.class,org.apache.hadoop.io.compress.GzipCodec.class); If anyone has any tips for what I should look into it would be appreciated. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How to pass a org.apache.spark.rdd.RDD in a recursive function
Passing RDD's around is not a good idea. RDD's are immutable and cant be changed inside functions. Have you considered taking a different approach? On Thu, Feb 26, 2015 at 3:42 AM, dritanbleco dritan.bl...@gmail.com wrote: Hello i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a recursive function. This table should be changed in any step of the recursion and could not be just a global var need help :) Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-a-org-apache-spark-rdd-RDD-in-a-recursive-function-tp21805.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: updateStateByKey and invFunction
You can use a reduceByKeyAndWindow with your specific time window. You can specify the inverse function in reduceByKeyAndWindow. On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma ashishonl...@gmail.com wrote: So say I want to calculate top K users visiting a page in the past 2 hours updated every 5 mins. so here I want to maintain something like this Page_01 = {user_01:32, user_02:3, user_03:7...} ... Basically a count of number of times a user visited a page. Here my key is page name/id and state is the hashmap. Now in updateStateByKey I get the previous state and new events coming *in* the window. Is there a way to also get the events going *out* of the window? This was I can incrementally update the state over a rolling window. What is the efficient way to do it in spark streaming? Thanks Ashish -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: On app upgrade, restore sliding window data.
I think this could be of some help to you. https://issues.apache.org/jira/browse/SPARK-3660 On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro matus.f...@kik.com wrote: Hi, Our application is being designed to operate at all times on a large sliding window (day+) of data. The operations performed on the window of data will change fairly frequently and I need a way to save and restore the sliding window after an app upgrade without having to wait the duration of the sliding window to warm up. Because it's an app upgrade, checkpointing will not work unfortunately. I can potentially dump the window to an outside storage periodically or on app shutdown, but I don't have an ideal way of restoring it. I thought about two non-ideal solutions: 1. Load the previous data all at once into the sliding window on app startup. The problem is, at one point I will have double the data in the sliding window until the initial batch of data goes out of scope. 2. Broadcast the previous state of the window separately from the window. Perform the operations on both sets of data until it comes out of scope. The problem is, the data will not fit into memory. Solutions that would solve my problem: 1. Ability to pre-populate sliding window. 2. Have control over batch slicing. It would be nice for a Receiver to dictate the current batch timestamp in order to slow down or fast forward time. Any feedback would be greatly appreciated! Thank you, Matus - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Cannot access Spark web UI
) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Powered by Jetty:// -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com* -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark and Spark Streaming code sharing best practice.
I find monoids pretty useful in this respect, basically separating out the logic in a monoid and then applying the logic to either a stream or a batch. A list of such practices could be really useful. On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hey, It seems pretty clear that one of the strength of Spark is to be able to share your code between your batch and streaming layer. Though, given that Spark streaming uses DStream being a set of RDDs and Spark uses a single RDD there might some complexity associated with it. Of course since DStream is a superset of RDDs, one can just run the same code at the RDD granularity using DStream::forEachRDD. While this should work for map, I am not sure how that can work when it comes to reduce phase given that a group of keys spans across multiple RDDs. One of the option is to change the dataset object on which a job works on. For example of passing an RDD to a class method, one passes a higher level object (MetaRDD) that wraps around RDD or DStream depending the context. At this point the job calls its regular maps, reduces and so on and the MetaRDD wrapper would delegate accordingly. Just would like to know the official best practice from the spark community though. Thanks, -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark and Spark Streaming code sharing best practice.
Monoids are useful in Aggregations and try avoiding Anonymous functions, creating out functions out of the spark code allows the functions to be reused(Possibly between Spark and Spark Streaming) On Thu, Feb 19, 2015 at 6:56 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Thanks Arush. I will check that out. On Wed, Feb 18, 2015 at 11:06 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I find monoids pretty useful in this respect, basically separating out the logic in a monoid and then applying the logic to either a stream or a batch. A list of such practices could be really useful. On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Hey, It seems pretty clear that one of the strength of Spark is to be able to share your code between your batch and streaming layer. Though, given that Spark streaming uses DStream being a set of RDDs and Spark uses a single RDD there might some complexity associated with it. Of course since DStream is a superset of RDDs, one can just run the same code at the RDD granularity using DStream::forEachRDD. While this should work for map, I am not sure how that can work when it comes to reduce phase given that a group of keys spans across multiple RDDs. One of the option is to change the dataset object on which a job works on. For example of passing an RDD to a class method, one passes a higher level object (MetaRDD) that wraps around RDD or DStream depending the context. At this point the job calls its regular maps, reduces and so on and the MetaRDD wrapper would delegate accordingly. Just would like to know the official best practice from the spark community though. Thanks, -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How to integrate hive on spark
Hi Did you try these steps. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started Thanks Arush On Wed, Feb 18, 2015 at 7:20 PM, sandeepvura sandeepv...@gmail.com wrote: Hi , I am new to sparks.I had installed spark on 3 node cluster.I would like to integrate hive on spark . can anyone please help me on this, Regards, Sandeep.v -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-hive-on-spark-tp21702.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: How to connect a mobile app (Android/iOS) with a Spark backend?
I am running Spark Jobs behind tomcat. We didn't face any issues.But for us the user base is very small. The possible blockers could be 1. If there are many users of the system. Then jobs might have to w8, you might want to think about the kind of scheduling you want to do. 2.Again if the no of users is a bit high, Tomcat doesn't scale really well.(Not sure how much of a blocker it is). Thanks Arush On Wed, Feb 18, 2015 at 6:41 PM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, I have dependency problems to use spark-core inside of a HttpServlet (see other mail from me). Maybe I'm wrong?! What I want to do: I develop a mobile app (Android and iOS) and want to connect them with Spark on backend side. To do this I want to use Tomcat. The app uses https to ask Tomcat for the needed data and Tomcat asks Spark. Is this the right way? Or is there a better way to connect my mobile apps with the Spark backend? I hope that I'm not the first one who want to do this. Ralph -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: JsonRDD to parquet -- data loss
I am not sure, if this the easiest way to solve your problem. But you can connect to the HIVE metastore(through derby) and find the HDFS path from there. On Wed, Feb 18, 2015 at 9:31 AM, Vasu C vasuc.bigd...@gmail.com wrote: Hi, I am running spark batch processing job using spark-submit command. And below is my code snippet. Basically converting JsonRDD to parquet and storing it in HDFS location. The problem I am facing is if multiple jobs are are triggered parallely, even though job executes properly (as i can see in spark webUI), there is no parquet file created in hdfs path. If 5 jobs are executed parallely than only 3 parquet files are getting created. Is this the data loss scenario ? Or am I missing something here. Please help me in this Here tableName is unique with timestamp appended to it. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val jsonRdd = sqlContext.jsonRDD(results) val parquetTable = sqlContext.parquetFile(parquetFilePath) parquetTable.registerTempTable(tableName) jsonRdd.insertInto(tableName) Regards, Vasu C -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: spark-core in a servlet
I am not sure if this could be causing the issue but spark is compatible with scala 2.10. Instead of spark-core_2.11 you might want to try spark-core_2.10 On Wed, Feb 18, 2015 at 5:44 AM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, I want to use spark-core inside of a HttpServlet. I use Maven for the build task but I have a dependency problem :-( I get this error message: ClassCastException: com.sun.jersey.server.impl.container.servlet.JerseyServletContainerInitializer cannot be cast to javax.servlet.ServletContainerInitializer When I add this exclusions it builds but than there are other classes not found at runtime: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.11/artifactId version1.2.1/version exclusions exclusion groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId /exclusion exclusion groupIdorg.eclipse.jetty/groupId artifactId*/artifactId /exclusion /exclusions /dependency What can I do? Thanks a lot!, Ralph -- Ralph Bergmann iOS and Android app developer www http://www.the4thFloor.eu mail ra...@the4thfloor.eu skypedasralph google+ https://plus.google.com/+RalphBergmann xing https://www.xing.com/profile/Ralph_Bergmann3 linkedin https://www.linkedin.com/in/ralphbergmann gulp https://www.gulp.de/Profil/RalphBergmann.html github https://github.com/the4thfloor pgp key id 0x421F9B78 pgp fingerprint CEE3 7AE9 07BE 98DF CD5A E69C F131 4A8E 421F 9B78 -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working
Hi Did you try to make maven pick the latest version http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management That way solrj won't cause any issue, you can try this and check if the part of your code where you access HDFS works fine? On Wed, Feb 18, 2015 at 10:23 AM, dgoldenberg dgoldenberg...@gmail.com wrote: I'm getting the below error when running spark-submit on my class. This class has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ 4.10.3 from within the class. This is in conflict with the older version, HttpClient 3.1 that's a dependency of Hadoop 2.4 (I'm running Spark 1.2.1 built for Hadoop 2.4). I've tried setting spark.files.userClassPathFirst to true in SparkConf in my program, also setting it to true in $SPARK-HOME/conf/spark-defaults.conf as spark.files.userClassPathFirst true No go, I'm still getting the error, as below. Is there anything else I can try? Are there any plans in Spark to support multiple class loaders? Exception in thread main java.lang.NoSuchMethodError: org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry; at org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:121) at org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:445) at org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:206) at org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:35) at org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:142) at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:118) at org.apache.solr.client.solrj.impl.HttpSolrServer.init(HttpSolrServer.java:168) at org.apache.solr.client.solrj.impl.HttpSolrServer.init(HttpSolrServer.java:141) ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Class-loading-issue-spark-files-userClassPathFirst-doesn-t-seem-to-be-working-tp21693.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Configration Problem? (need help to get Spark job executed)
.devpeng.xx:38773) with 8 cores 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150214103013-0001/1 on hostPort devpeng-db-cassandra-3.devpeng.xe:38773 with 8 cores, 512.0 MB RAM 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated: app-20150214103013-0001/0 is now LOADING 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated: app-20150214103013-0001/1 is now LOADING 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated: app-20150214103013-0001/0 is now RUNNING 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated: app-20150214103013-0001/1 is now RUNNING 15/02/14 10:30:13 INFO NettyBlockTransferService: Server created on 58200 15/02/14 10:30:13 INFO BlockManagerMaster: Trying to register BlockManager 15/02/14 10:30:13 INFO BlockManagerMasterActor: Registering block manager 192.168.2.103:58200 with 530.3 MB RAM, BlockManagerId(driver, , 58200) 15/02/14 10:30:13 INFO BlockManagerMaster: Registered BlockManager 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 15/02/14 10:30:14 INFO Cluster: New Cassandra host devpeng-db-cassandra-1.devpeng.gkh-setu.de/:9042 added 15/02/14 10:30:14 INFO Cluster: New Cassandra host /xxx:9042 added 15/02/14 10:30:14 INFO Cluster: New Cassandra host :9042 added 15/02/14 10:30:14 INFO CassandraConnector: Connected to Cassandra cluster: GKHDevPeng 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx (DC1) 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx (DC1) 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host (DC1) 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host x (DC1) 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host x (DC1) 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host x (DC1) 15/02/14 10:30:15 INFO CassandraConnector: Disconnected from Cassandra cluster: GKHDevPeng 15/02/14 10:30:16 INFO SparkContext: Starting job: count at TestApp.scala:23 15/02/14 10:30:16 INFO DAGScheduler: Got job 0 (count at TestApp.scala:23) with 3 output partitions (allowLocal=false) 15/02/14 10:30:16 INFO DAGScheduler: Final stage: Stage 0(count at TestApp.scala:23) 15/02/14 10:30:16 INFO DAGScheduler: Parents of final stage: List() 15/02/14 10:30:16 INFO DAGScheduler: Missing parents: List() 15/02/14 10:30:16 INFO DAGScheduler: Submitting Stage 0 (CassandraRDD[0] at RDD at CassandraRDD.scala:49), which has no missing parents 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(4472) called with curMem=0, maxMem=556038881 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.4 KB, free 530.3 MB) 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(3082) called with curMem=4472, maxMem=556038881 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.0 KB, free 530.3 MB) 15/02/14 10:30:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on x (size: 3.0 KB, free: 530.3 MB) 15/02/14 10:30:16 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/02/14 10:30:16 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838 15/02/14 10:30:16 INFO DAGScheduler: Submitting 3 missing tasks from Stage 0 (CassandraRDD[0] at RDD at CassandraRDD.scala:49) 15/02/14 10:30:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 15/02/14 10:30:31 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Identify the performance bottleneck from hardware prospective
Hi How big is your dataset? Thanks Arush On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate jalaf...@eng.ucsd.edu wrote: Thank you very much for your reply! My task is to count the number of word pairs in a document. If w1 and w2 occur together in one sentence, the number of occurrence of word pair (w1, w2) adds 1. So the computational part of this algorithm is simply a two-level for-loop. Since the cluster is monitored by Ganglia, I can easily see that neither CPU or network IO is under pressure. The only parameter left is memory. In the executor tab of Spark Web UI, I can see a column named memory used. It showed that only 6GB of 20GB memory is used. I understand this is measuring the size of RDD that persist in memory. So can I at least assume the data/object I used in my program is not exceeding memory limit? My confusion here is, why can't my program run faster while there is still efficient memory, CPU time and network bandwidth it can utilize? Best regards, Julaiti On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das ak...@sigmoidanalytics.com wrote: What application are you running? Here's a few things: - You will hit bottleneck on CPU if you are doing some complex computation (like parsing a json etc.) - You will hit bottleneck on Memory if your data/objects used in the program is large (like defining playing with HashMaps etc inside your map* operations), Here you can set spark.executor.memory to a higher number and also you can change the spark.storage.memoryFraction whose default value is 0.6 of your executor memory. - Network will be a bottleneck if data is not available locally on one of the worker and hence it has to collect it from others, which is a lot of Serialization and data transfer across your cluster. Thanks Best Regards On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate jalaf...@eng.ucsd.edu wrote: Hi there, I am trying to scale up the data size that my application is handling. This application is running on a cluster with 16 slave nodes. Each slave node has 60GB memory. It is running in standalone mode. The data is coming from HDFS that also in same local network. In order to have an understanding on how my program is running, I also had a Ganglia installed on the cluster. From previous run, I know the stage that taking longest time to run is counting word pairs (my RDD consists of sentences from a corpus). My goal is to identify the bottleneck of my application, then modify my program or hardware configurations according to that. Unfortunately, I didn't find too much information on Spark monitoring and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for application tuning from tasks perspective. Basically, his focus is on tasks that oddly slower than the average. However, it didn't solve my problem because there is no such tasks that run way slow than others in my case. So I tried to identify the bottleneck from hardware prospective. I want to know what the limitation of the cluster is. I think if the executers are running hard, either CPU, memory or network bandwidth (or maybe the combinations) is hitting the roof. But Ganglia reports the CPU utilization of cluster is no more than 50%, network utilization is high for several seconds at the beginning, then drop close to 0. From Spark UI, I can see the nodes with maximum memory usage is consuming around 6GB, while spark.executor.memory is set to be 20GB. I am very confused that the program is not running fast enough, while hardware resources are not in shortage. Could you please give me some hints about what decides the performance of a Spark application from hardware perspective? Thanks! Julaiti -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Question about spark streaming+Flume
Hi Can you try this val lines = FlumeUtils.createStream(ssc,localhost,) // Print out the count of events received from this server in each batch lines.count().map(cnt = Received + cnt + flume events. at + System.currentTimeMillis() ) lines.forechRDD(_.foreach(println)) Thanks Arush On Tue, Feb 17, 2015 at 11:17 AM, bit1...@163.com bit1...@163.com wrote: Hi, I am trying Spark Streaming + Flume example: 1. Code object SparkFlumeNGExample { def main(args : Array[String]) { val conf = new SparkConf().setAppName(SparkFlumeNGExample) val ssc = new StreamingContext(conf, Seconds(10)) val lines = FlumeUtils.createStream(ssc,localhost,) // Print out the count of events received from this server in each batch lines.count().map(cnt = Received + cnt + flume events. at + System.currentTimeMillis() ).print() ssc.start() ssc.awaitTermination(); } } 2. I submit the application with following sh: ./spark-submit --deploy-mode client --name SparkFlumeEventCount --master spark://hadoop.master:7077 --executor-memory 512M --total-executor-cores 2 --class spark.examples.streaming.SparkFlumeNGWordCount spark-streaming-flume.jar When I write data to flume, I only notice the following console information that input is added. storage.BlockManagerInfo: Added input-0-1424151807400 in memory on localhost:39338 (size: 1095.0 B, free: 267.2 MB) 15/02/17 00:43:30 INFO scheduler.JobScheduler: Added jobs for time 142415181 ms 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time 142415182 ms 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time 142415187 ms But I didn't the output from the code: Received X flumes events I am no idea where the problem is, any idea? Thanks -- -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: hive-thriftserver maven artifact
You can build your own spark with option -Phive-thriftserver. You can publish the jars locally. I hope that would solve your problem. On Mon, Feb 16, 2015 at 8:54 PM, Marco marco@gmail.com wrote: Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com: I searched for 'spark-hive-thriftserver_2.10' on this page: http://mvnrepository.com/artifact/org.apache.spark Looks like it is not published. On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote: Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact in a public repository ? I have not found it @Maven Central. Thanks, Marco -- Viele Grüße, Marco -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Master dies after program finishes normally
How many nodes do you have in your cluster, how many cores, what is the size of the memory? On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi Arush, Mine is a CDH5.3 with Spark 1.2. The only change to my spark programs are -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000. ..Manas On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting down ActorSystem [sparkMaster] java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69) at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53) at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.json4s.MonadicJValue.org $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22) at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16) at org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726) at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675) at org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653) at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399) Can anyone help? ..Manas -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Master dies after program finishes normally
What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com wrote: Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting down ActorSystem [sparkMaster] java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.List$.newBuilder(List.scala:396) at scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69) at scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58) at scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53) at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26) at org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.json4s.MonadicJValue.org $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22) at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16) at org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726) at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675) at org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653) at org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399) Can anyone help? ..Manas -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark Streaming distributed batch locking
* We have an inbound stream of sensor data for millions of devices (which have unique identifiers). Spark Streaming can handel events in the ballpark of 100-500K records/sec/node - *so you need to decide on a cluster accordingly. And its scalable.* * We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches. - *You need to do stateful stream processing, Spark streaming allows you to do that checkout - **updateStateByKey -**http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html* * Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock. - *You can make the source device as a key and then updateStateByKey in spark using the key.* * In addition the event device data needs to be processed in the order that the events occurred. - *You would need to implement this in your code adding timestamp as a data item. Spark Streaming dosnt ensure in order delivery of your event.* On Thu, Feb 12, 2015 at 4:51 PM, Legg John john.l...@axonvibe.com wrote: Hi After doing lots of reading and building a POC for our use case we are still unsure as to whether Spark Streaming can handle our use case: * We have an inbound stream of sensor data for millions of devices (which have unique identifiers). * We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches. * Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock. * In addition the event device data needs to be processed in the order that the events occurred. Essentially we can¹t have two batches for the same device being processed at the same time. Can Spark handle our use case? Any advice appreciated. Regards John - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: 8080 port password protection
You could apply a password using a filter using a server. Though it dosnt looks like the right grp for the question. It can be done for spark also for Spark UI. On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) master.z...@gmail.com wrote: Hi everyone, Im creating a development machine in AWS and i would like to protect the port 8080 using a password. Is it possible? Best Regards *Jairo Moreno* -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: SparkSQL + Tableau Connector
I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Error while querying hive table from spark shell
(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410) ... 93 more Caused by: java.lang.NoClassDefFoundError: org/datanucleus/exceptions/NucleusException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.forName(JDOHelper.java:2015) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:497) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:475) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:356) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:171) ... 98 more Caused by: java.lang.ClassNotFoundException: org.datanucleus.exceptions.NucleusException at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 124 more Regards, Kundan -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: SparkSQL + Tableau Connector
BTW what tableau connector are you using? On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: I am a little confused here, why do you want to create the tables in hive. You want to create the tables in spark-sql, right? If you are not able to find the same tables through tableau then thrift is connecting to a diffrent metastore than your spark-shell. One way to specify a metstore to thrift is to provide the path to hive-site.xml while starting thrift using --files hive-site.xml. similarly you can specify the same metastore to your spark-submit or sharp-shell using the same option. On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote: Arush, As for #2 do you mean something like this from the docs: // sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src) // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, value).collect().foreach(println) Or did you have something else in mind? -Todd On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote: Arush, Thank you will take a look at that approach in the morning. I sort of figured the answer to #1 was NO and that I would need to do 2 and 3 thanks for clarifying it for me. -Todd On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: ZeroMQ and pyspark.streaming
No, zeromq api is not supported in python as of now. On 5 Feb 2015 21:27, Sasha Kacanski skacan...@gmail.com wrote: Does pyspark supports zeroMQ? I see that java does it, but I am not sure for Python? regards -- Aleksandar Kacanski
Re: SparkSQL + Tableau Connector
1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? NO 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? Create a table in spark sql to expose via spark sql 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2 you would need to configure thrift to read from the metastore you expect it read from - by default it reads from metastore_db directory present in the directory used to launch the thrift server. On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote: Hi, I'm trying to understand how and what the Tableau connector to SparkSQL is able to access. My understanding is it needs to connect to the thriftserver and I am not sure how or if it exposes parquet, json, schemaRDDs, or does it only expose schemas defined in the metastore / hive. For example, I do the following from the spark-shell which generates a schemaRDD from a csv file and saves it as a JSON file as well as a parquet file. import *org.apache.sql.SQLContext *import com.databricks.spark.csv._ val sqlContext = new SQLContext(sc) val test = sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out) test.saveAsParquetFile(/data/out) When I connect from Tableau, the only thing I see is the default schema and nothing in the tables section. So my questions are: 1. Can the connector fetch or query schemaRDD's saved to Parquet or JSON files? 2. Do I need to do something to expose these via hive / metastore other than creating a table in hive? 3. Does the thriftserver need to be configured to expose these in some fashion, sort of related to question 2. TIA for the assistance. -Todd
Re: Spark 1.2.x Yarn Auxiliary Shuffle Service
Is this what you are looking for 1. Build Spark with the YARN profile http://spark.apache.org/docs/1.2.0/building-spark.html. Skip this step if you are using a pre-packaged distribution. 2. Locate the spark-version-yarn-shuffle.jar. This should be under $SPARK_HOME/network/yarn/target/scala-version if you are building Spark yourself, and under lib if you are using a distribution. 3. Add this jar to the classpath of all NodeManagers in your cluster. 4. In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService. Additionally, set all relevantspark.shuffle.service.* configurations http://spark.apache.org/docs/1.2.0/configuration.html. 5. Restart all NodeManagers in your cluster. On Wed, Jan 28, 2015 at 1:30 AM, Corey Nolet cjno...@gmail.com wrote: I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property spark.shuffle.service.enabled in the spark-defaults.conf. The code mentions that this is supposed to be run inside the Nodemanager so I'm assuming it needs to be wired up in the yarn-site.xml under the yarn.nodemanager.aux-services property? -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: checking
Yes they are. On Fri, Feb 6, 2015 at 5:06 PM, Mohit Durgapal durgapalmo...@gmail.com wrote: Just wanted to know If my emails are reaching the user list. Regards Mohit -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Use Spark as multi-threading library and deprecate web UI
You can use akka, that is the underlying Multithreading library Spark uses. On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng szheng.c...@gmail.com wrote: Nice. I just try and it works. Thanks very much! And I notice there is below in the log: 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@NY02913D.global.local:8162] 15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@NY02913D.global.local:8162/user/HeartbeatReceiver As I understand. The local mode will have driver and executors in the same java process. So is there any way for me to also disable above two listeners? Or they are not optional even in local mode? Regards, Shuai -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, February 05, 2015 10:53 AM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Use Spark as multi-threading library and deprecate web UI Do you mean disable the web UI? spark.ui.enabled=false Sure, it's useful with master = local[*] too. On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, It might sounds weird, but I think spark is perfect to be used as a multi-threading library in some cases. The local mode will naturally boost multiple thread when required. Because it is more restrict and less chance to have potential bug in the code (because it is more data oriental, not thread oriental). Of course, it cannot be used for all cases, but in most of my applications, it is enough (90%). I want to hear other people’s idea about this. BTW: if I run spark in local mode, how to deprecate the web UI (default listen on 4040), because I don’t want to start the UI every time if I use spark as a local library. Regards, Shuai - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: how to specify hive connection options for HiveContext
Hi Are you trying to run a spark job from inside eclipse? and want the job to access hive configuration options.? To access hive tables? Thanks Arush On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, I know two options, one for spark_submit, the other one for spark-shell, but how to set for programs running inside eclipse? Regards, -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Errors in the workers machines
1. For what reasons is using Spark the above ports? What internal component is triggering them? -Akka(guessing from the error log) is used to schedule tasks and to notify executors - the ports used are random by default 2. How I can get rid of these errors? - Probably the ports are not open on your server.You can set certain ports and open them using spark.driver.port and spark.executor.port. Or you can open all ports between the masters and slaves. for a cluster on ec2, the ec2 script takes care of the required. 3. Why the application is still finished with success? - DO you have more worker in the cluster which are able to connect. 4. Why is trying with more ports? - Not sure, Its picking the ports randomly. On Thu, Feb 5, 2015 at 2:30 PM, Spico Florin spicoflo...@gmail.com wrote: Hello! I received the following errors in the workerLog.log files: ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660] - [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with [akka.tcp://sparkExecutor@stream4:47929]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@stream4:47929] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: stream4/x.x.x.x:47929 ] (For security reason have masked the IP with x.x.x.x). The same errors occurs for different ports (42395,39761). Even though I have these errors the application is finished with success. I have the following questions: 1. For what reasons is using Spark the above ports? What internal component is triggering them? 2. How I can get rid of these errors? 3. Why the application is still finished with success? 4. Why is trying with more ports? I look forward for your answers. Regards. Florin -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com Arush Kharbanda || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Union in Spark
Hi Deep, What is your configuration and what is the size of the 2 data sets? Thanks Arush On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: I did not check the console because once the job starts I cannot run anything else and have to force shutdown the system. I commented parts of codes and I tested. I doubt it is because of union. So, I want to change it to something else and see if the problem persists. Thank you On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote: Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Error when running spark in debug mode
Hi Ankur, Its running fine for me for spark 1.1 and changes to log4j properties file. Thanks Arush On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi Arush I have configured log4j by updating the file log4j.properties in SPARK_HOME/conf folder. If it was a log4j defect we would get error in debug mode in all apps. Thanks Ankur Hi Ankur, How are you enabling the debug level of logs. It should be a log4j configuration. Even if there would be some issue it would be in log4j and not in spark. Thanks Arush On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, When ever I enable DEBUG level logs for my spark cluster, on running a job all the executors die with the below exception. On disabling the DEBUG logs my jobs move to the next step. I am on spark-1.1.0 Is this a known issue with spark? Thanks Ankur 2015-01-29 22:27:42,467 [main] INFO org.apache.spark.SecurityManager - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off 2015-01-29 22:27:42,871 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2015-01-29 22:27:42,912 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Starting remoting 2015-01-29 22:27:43,057 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Remoting started; listening on addresses :[akka.tcp:// driverPropsFetcher@10.77.9.155:36035] 2015-01-29 22:27:43,060 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Remoting now listens on addresses: [akka.tcp:// driverPropsFetcher@10.77.9.155:36035] 2015-01-29 22:27:43,067 [main] INFO org.apache.spark.util.Utils - Successfully started service 'driverPropsFetcher' on port 36035. 2015-01-29 22:28:13,077 [main] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:ubuntu cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) ... 7 more -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Error when running spark in debug mode
Can you share your log4j file. On Sat, Jan 31, 2015 at 1:35 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Ankur, Its running fine for me for spark 1.1 and changes to log4j properties file. Thanks Arush On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi Arush I have configured log4j by updating the file log4j.properties in SPARK_HOME/conf folder. If it was a log4j defect we would get error in debug mode in all apps. Thanks Ankur Hi Ankur, How are you enabling the debug level of logs. It should be a log4j configuration. Even if there would be some issue it would be in log4j and not in spark. Thanks Arush On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, When ever I enable DEBUG level logs for my spark cluster, on running a job all the executors die with the below exception. On disabling the DEBUG logs my jobs move to the next step. I am on spark-1.1.0 Is this a known issue with spark? Thanks Ankur 2015-01-29 22:27:42,467 [main] INFO org.apache.spark.SecurityManager - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu) 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off 2015-01-29 22:27:42,871 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2015-01-29 22:27:42,912 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Starting remoting 2015-01-29 22:27:43,057 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Remoting started; listening on addresses :[akka.tcp:// driverPropsFetcher@10.77.9.155:36035] 2015-01-29 22:27:43,060 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO Remoting - Remoting now listens on addresses: [akka.tcp:// driverPropsFetcher@10.77.9.155:36035] 2015-01-29 22:27:43,067 [main] INFO org.apache.spark.util.Utils - Successfully started service 'driverPropsFetcher' on port 36035. 2015-01-29 22:28:13,077 [main] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:ubuntu cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) ... 4 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) ... 7 more -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark streaming - tracking/deleting processed files
Hi Ganterm, Thats obvious. If you look at the documentation for textFileStream. Create a input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by moving them from another location within the same file system. File names starting with . are ignored. You need to move files to the directory when the system is up, You need to manage that using shell script. Moving files one at a time, moving them out via another script. On Fri, Jan 30, 2015 at 11:37 PM, ganterm gant...@gmail.com wrote: We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed. Is there a solution for that? Is there a way to keep track what files have been processed and can we force older files to be picked up? Is there a way to delete the processed files? Thanks! Markus -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: groupByKey is not working
Hi Amit, What error does it through? Thanks Arush On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera amit.bd...@gmail.com wrote: hi all, my sbt file is like this: name := Spark version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.1.0 libraryDependencies += net.sf.opencsv % opencsv % 2.3 *code:* object SparkJob { def pLines(lines:Iterator[String])={ val parser=new CSVParser() lines.map(l={val vs=parser.parseLine(l) (vs(0),vs(1).toInt)}) } def main(args: Array[String]) { val conf = new SparkConf().setAppName(Spark Job).setMaster(local) val sc = new SparkContext(conf) val data = sc.textFile(/home/amit/testData.csv).cache() val result = data.mapPartitions(pLines).groupByKey //val list = result.filter(x= {(x._1).contains(24050881)}) } } Here groupByKey is not working . But same thing is working from *spark-shell.* Please help me Thanks Amit -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Building Spark behind a proxy
Hi Somya, I meant when you configure the JAVA_OPTS and when you don't configure the JAVA_OPTS is there any difference in the error message? Are you facing the same issue when you built using maven? Thanks Arush On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I can do a wget http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and get the file successfully on a shell. On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com wrote: At least a part of it is due to connection refused, can you check if curl can reach the URL with proxies - [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Connection refused from http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta soumya.sima...@gmail.com wrote: On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME Did you try using maven? and doing the proxy settings there. No I've not tried maven yet. However, I did set proxy settings inside my .m2/setting.xml, but it didn't make any difference. -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example
Hi Sarwar, For a quick fix you can exclude dependencies for yarn(you wont be needing them if you are running locally). libraryDependencies += log4j % log4j % 1.2.15 exclude(javax.jms, jms) You can also analyze your dependencies using this plugin https://github.com/jrudolph/sbt-dependency-graph Thanks Arush On Thu, Jan 29, 2015 at 4:20 AM, Sarwar Bhuiyan sarwar.bhui...@gmail.com wrote: Hello all, I'm trying to build the sample application on the spark 1.2.0 quickstart page (https://spark.apache.org/docs/latest/quick-start.html) using the following build.sbt file: name := Simple Project version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.2.0 Upon calling sbt package, it downloaded a lot of dependencies but eventually failed with some warnings and errors. Here's the snippet: [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hadoop#hadoop-yarn-common;1.0.4: not found [warn] :: org.apache.hadoop#hadoop-yarn-client;1.0.4: not found [warn] :: org.apache.hadoop#hadoop-yarn-api;1.0.4: not found [warn] :: [warn] :: [warn] :: FAILED DOWNLOADS:: [warn] :: ^ see resolution messages for details ^ :: [warn] :: [warn] :: org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit [warn] :: org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit [warn] :: org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020!javax.mail.glassfish.orbit [warn] :: org.eclipse.jetty.orbit#javax.activation;1.1.0.v201105071233!javax.activation.orbit [warn] :: Upon checking the maven repositories there doesn't seem to be any hadoop-yarn-common 1.0.4. I've tried explicitly setting a dependency to hadoop-yarn-common 2.4.0 for example but to no avail. I've also tried setting a number of different repositories to see if maybe one of them might have that dependency. Still no dice. What's the best way to resolve this for a quickstart situation? Do I have to set some sort of profile or environment variable which doesn't try to bring the 1.0.4 yarn version? Any help would be greatly appreciated. Sarwar -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: unknown issue in submitting a spark job
) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) 15/01/29 08:54:33 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-10-8-191.us-west-2.compute.internal, 47722, 0) with no recent heart beats: 82575ms exceeds 45000ms 15/01/29 08:54:33 INFO spark.ContextCleaner: Cleaned RDD 1 15/01/29 08:54:33 WARN util.AkkaUtils: Error sending message in 1 attempts akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-538003375]] had already been terminated. at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:175) at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:126) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/unknown-issue-in-submitting-a-spark-job-tp21418.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: is there a master for spark cluster in ec2
Hi Mohit, You can set the master instance type with -m. To setup a cluster you need to use the ec2/spark-ec2 script. You need to create a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your aws web console under Security Credentials. And pass it on to script above. Once you do that you should be able to setup your cluster using spark-ec2 options. Thanks Arush On Thu, Jan 29, 2015 at 6:41 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, Probably a naive question.. But I am creating a spark cluster on ec2 using the ec2 scripts in there.. But is there a master param I need to set.. ./bin/pyspark --master [ ] ?? I don't yet fully understand the ec2 concepts so just wanted to confirm this?? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Building Spark behind a proxy
) at xsbt.boot.Launch$.launch(Launch.scala:117) at xsbt.boot.Launch$.apply(Launch.scala:19) at xsbt.boot.Boot$.runImpl(Boot.scala:44) at xsbt.boot.Boot$.main(Boot.scala:20) at xsbt.boot.Boot.main(Boot.scala) [error] org.apache.maven.model.building.ModelBuildingException: 1 problem was encountered while building the effective model for org.apache.spark:spark-parent:1.1.1 [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact org.apache:apache:pom:14 from/to central ( http://repo.maven.apache.org/maven2): Error transferring file: Connection refused from http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom and 'parent.relativePath' points at wrong local POM @ line 21, column 11 [error] Use 'last' for the full log. Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Hive on Spark vs. SparkSQL using Hive ?
Spark SQL on Hive 1. The purpose of Spark SQL is to allow Spark users to selectively use SQL expressions (with not a huge number of functions currently supported) when writing Spark jobs 2. Already Available Hive on Spark 1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future. 2. Under Development On Thu, Jan 29, 2015 at 4:54 AM, ogoh oke...@gmail.com wrote: Hello, probably this question was already asked but still I'd like to confirm from Spark users. This following blog shows 'hive on spark' : http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/ . How is it different from using hive as data storage of SparkSQL (http://spark.apache.org/docs/latest/sql-programming-guide.html)? Also, is there any update about SparkSQL's next release (current one is still alpha)? Thanks, OGoh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-vs-SparkSQL-using-Hive-tp21412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
Re: Spark serialization issues with third-party libraries
Hi You can see my code here . Its a POC to implement UIMA on spark https://bitbucket.org/SigmoidDev/uimaspark https://bitbucket.org/SigmoidDev/uimaspark/src/8476fdf16d84d0f517cce45a8bc1bd3410927464/UIMASpark/src/main/scala/ *UIMAProcessor.scala*?at=master this is the class where the major part of the integration happens. Thanks Arush On Sun, Nov 23, 2014 at 7:52 PM, jatinpreet jatinpr...@gmail.com wrote: Thanks Sean, I was actually using instances created elsewhere inside my RDD transformations which as I understand is against Spark programming model. I was referred to a talk about UIMA and Spark integration from this year's Spark summit, which had a workaround for this problem. I just had to make some class members transient. http://spark-summit.org/2014/talk/leveraging-uima-in-spark Thanks - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com