How to separate messages of different topics.
I want to read from many topics in Kafka and know from where each message is coming (topic1, topic2 and so on). val kafkaParams = Map[String, String](metadata.broker.list - myKafka:9092) val topics = Set(EntryLog, presOpManager) val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) Is there some way to separate the messages for topics with just one directStream? or should I create different streamings for each topic?
Re: JAVA for SPARK certification
And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
Spark + Kakfa with directStream
I'm tryting to execute the Hello World example with Spark + Kafka ( https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala) with createDirectStream and I get this error. java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.init(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:176) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) I have checked the jars and it's the kafka-2.10-0.8.2.jar in the classpath with the MessageAndMetadata class. Even I navigated through eclipse and I came to this class with the right parameters. My pom.xml has these dependecies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.3.1/version !-- scopeprovided/scope -- exclusions exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId /exclusion exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId /exclusion /exclusions /dependency dependency groupIdorg.apache.kafka/groupId artifactIdkafka_2.10/artifactId version0.8.2.1/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.3.1/version !-- scopeprovided/scope -- exclusions exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId /exclusion exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId /exclusion /exclusions /dependency /dependencies Does somebody know where it's the error?? Thanks.
RE: 回复:Re: sparksql running slow while joining_2_tables.
56mb / 26mb is very small size, do you observe data skew? More precisely, many records with the same chrname / name? And can you also double check the jvm settings for the executor process? From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 5, 2015 7:50 PM To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining_2_tables. Hi guys, attache the pic of physical plan and logs.Thanks. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com 收件人:Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com, luohui20...@sina.commailto:luohui20...@sina.com luohui20...@sina.commailto:luohui20...@sina.com, Olivier Girardot ssab...@gmail.commailto:ssab...@gmail.com, user user@spark.apache.orgmailto:user@spark.apache.org 主题:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 13点18分 I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.commailto:luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use Explain extended select …. From: luohui20...@sina.commailto:luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 05, 2015 9:52 AM To: Cheng, Hao; Olivier Girardot; user Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold. refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options and how to check my app's physical plan,and others things like optimized plan,executable plan.etc thanks Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com 收件人:Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com, luohui20...@sina.commailto:luohui20...@sina.com luohui20...@sina.commailto:luohui20...@sina.com, Olivier Girardot ssab...@gmail.commailto:ssab...@gmail.com, user user@spark.apache.orgmailto:user@spark.apache.org 主题:RE: 回复:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 08点38分 Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.commailto:luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.commailto:luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics . it seems like a GC issue. I also tried with different parameters like memory size of driverexecutor, memory fraction, java opts... but this issue still happens. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Olivier Girardot ssab...@gmail.commailto:ssab...@gmail.com 收件人:luohui20...@sina.commailto:luohui20...@sina.com, user user@spark.apache.orgmailto:user@spark.apache.org 主题:Re: sparksql running slow while joining 2 tables. 日期:2015年05月04日 20点46分 Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, luohui20...@sina.commailto:luohui20...@sina.com a écrit : hi guys when i am running a sql like select a.namehttp://a.name/,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.namehttp://a.name/ = b.namehttp://b.name/) where (b.startpoint a.startpoint + 25); I found sparksql running slow in minutes which may caused by very long GC and shuffle time. table db is created from a txt file size at 56mb while table sample sized at 26mb, both at small size. my spark cluster is a standalone pseudo-distributed spark cluster with 8g executor and 4g driver manager. any advises? thank you guys. Thanksamp;Best regards! 罗辉 San.Luo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
multiple hdfs folder files input to PySpark
Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure day1 file1.txt file2.txt day2 file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job multiple files which located in Multiple folders (on hdfs). Using textFile method works fine for single file or folder , but how can I do it using multiple folders? Is there a way to pass array , list of files? 2) What is the meaning of partition parameter in textFile method? sc = SparkContext(appName=TAD) lines = sc.textFile(my input, 1) Thanks Oleg.
Re: JAVA for SPARK certification
Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
Re: Unable to join table across data sources using sparkSQL
I suggest you to try with date_dim.d_year in the query On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi Ankit, printSchema() works fine for all the tables. hiveStoreSalesDF.printSchema() root |-- store_sales.ss_sold_date_sk: integer (nullable = true) |-- store_sales.ss_sold_time_sk: integer (nullable = true) |-- store_sales.ss_item_sk: integer (nullable = true) |-- store_sales.ss_customer_sk: integer (nullable = true) |-- store_sales.ss_cdemo_sk: integer (nullable = true) |-- store_sales.ss_hdemo_sk: integer (nullable = true) |-- store_sales.ss_addr_sk: integer (nullable = true) |-- store_sales.ss_store_sk: integer (nullable = true) |-- store_sales.ss_promo_sk: integer (nullable = true) |-- store_sales.ss_ticket_number: integer (nullable = true) |-- store_sales.ss_quantity: integer (nullable = true) |-- store_sales.ss_wholesale_cost: double (nullable = true) |-- store_sales.ss_list_price: double (nullable = true) |-- store_sales.ss_sales_price: double (nullable = true) |-- store_sales.ss_ext_discount_amt: double (nullable = true) |-- store_sales.ss_ext_sales_price: double (nullable = true) |-- store_sales.ss_ext_wholesale_cost: double (nullable = true) |-- store_sales.ss_ext_list_price: double (nullable = true) |-- store_sales.ss_ext_tax: double (nullable = true) |-- store_sales.ss_coupon_amt: double (nullable = true) |-- store_sales.ss_net_paid: double (nullable = true) |-- store_sales.ss_net_paid_inc_tax: double (nullable = true) |-- store_sales.ss_net_profit: double (nullable = true) dateDimDF.printSchema() root |-- d_date_sk: integer (nullable = false) |-- d_date_id: string (nullable = false) |-- d_date: date (nullable = true) |-- d_month_seq: integer (nullable = true) |-- d_week_seq: integer (nullable = true) |-- d_quarter_seq: integer (nullable = true) |-- d_year: integer (nullable = true) |-- d_dow: integer (nullable = true) |-- d_moy: integer (nullable = true) |-- d_dom: integer (nullable = true) |-- d_qoy: integer (nullable = true) |-- d_fy_year: integer (nullable = true) |-- d_fy_quarter_seq: integer (nullable = true) |-- d_fy_week_seq: integer (nullable = true) |-- d_day_name: string (nullable = true) |-- d_quarter_name: string (nullable = true) |-- d_holiday: string (nullable = true) |-- d_weekend: string (nullable = true) |-- d_following_holiday: string (nullable = true) |-- d_first_dom: integer (nullable = true) |-- d_last_dom: integer (nullable = true) |-- d_same_day_ly: integer (nullable = true) |-- d_same_day_lq: integer (nullable = true) |-- d_current_day: string (nullable = true) |-- d_current_week: string (nullable = true) |-- d_current_month: string (nullable = true) |-- d_current_quarter: string (nullable = true) |-- d_current_year: string (nullable = true) itemDF.printSchema() root |-- i_item_sk: integer (nullable = false) |-- i_item_id: string (nullable = false) |-- i_rec_start_date: date (nullable = true) |-- i_rec_end_date: date (nullable = true) |-- i_item_desc: string (nullable = true) |-- i_current_price: decimal (nullable = true) |-- i_wholesale_cost: decimal (nullable = true) |-- i_brand_id: integer (nullable = true) |-- i_brand: string (nullable = true) |-- i_class_id: integer (nullable = true) |-- i_class: string (nullable = true) |-- i_category_id: integer (nullable = true) |-- i_category: string (nullable = true) |-- i_manufact_id: integer (nullable = true) |-- i_manufact: string (nullable = true) |-- i_size: string (nullable = true) |-- i_formulation: string (nullable = true) |-- i_color: string (nullable = true) |-- i_units: string (nullable = true) |-- i_container: string (nullable = true) |-- i_manager_id: integer (nullable = true) |-- i_product_name: string (nullable = true) Regards, Ishwardeep *From:* ankitjindal [via Apache Spark User List] [mailto:ml-node+[hidden email] http:///user/SendEmail.jtp?type=nodenode=22768i=0] *Sent:* Tuesday, May 5, 2015 5:00 PM *To:* Ishwardeep Singh *Subject:* RE: Unable to join table across data sources using sparkSQL Just check the Schema of both the tables using frame.printSchema(); -- *If you reply to this email, your message will be added to the discussion below:* http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22766.html To unsubscribe from Unable to join table across data sources using sparkSQL, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
Re: Spark + Kakfa with directStream
Sorry, I had a duplicated kafka dependency with another older version in another pom.xml 2015-05-05 14:46 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: I'm tryting to execute the Hello World example with Spark + Kafka ( https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala) with createDirectStream and I get this error. java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.init(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:176) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) I have checked the jars and it's the kafka-2.10-0.8.2.jar in the classpath with the MessageAndMetadata class. Even I navigated through eclipse and I came to this class with the right parameters. My pom.xml has these dependecies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.3.1/version !-- scopeprovided/scope -- exclusions exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId /exclusion exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId /exclusion /exclusions /dependency dependency groupIdorg.apache.kafka/groupId artifactIdkafka_2.10/artifactId version0.8.2.1/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version1.3.1/version !-- scopeprovided/scope -- exclusions exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-mapper-asl/artifactId /exclusion exclusion groupIdorg.codehaus.jackson/groupId artifactIdjackson-core-asl/artifactId /exclusion /exclusions /dependency /dependencies Does somebody know where it's the error?? Thanks.
Re: JAVA for SPARK certification
I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com wrote: Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
RE: Unable to join table across data sources using sparkSQL
Hi Ankit, printSchema() works fine for all the tables. hiveStoreSalesDF.printSchema() root |-- store_sales.ss_sold_date_sk: integer (nullable = true) |-- store_sales.ss_sold_time_sk: integer (nullable = true) |-- store_sales.ss_item_sk: integer (nullable = true) |-- store_sales.ss_customer_sk: integer (nullable = true) |-- store_sales.ss_cdemo_sk: integer (nullable = true) |-- store_sales.ss_hdemo_sk: integer (nullable = true) |-- store_sales.ss_addr_sk: integer (nullable = true) |-- store_sales.ss_store_sk: integer (nullable = true) |-- store_sales.ss_promo_sk: integer (nullable = true) |-- store_sales.ss_ticket_number: integer (nullable = true) |-- store_sales.ss_quantity: integer (nullable = true) |-- store_sales.ss_wholesale_cost: double (nullable = true) |-- store_sales.ss_list_price: double (nullable = true) |-- store_sales.ss_sales_price: double (nullable = true) |-- store_sales.ss_ext_discount_amt: double (nullable = true) |-- store_sales.ss_ext_sales_price: double (nullable = true) |-- store_sales.ss_ext_wholesale_cost: double (nullable = true) |-- store_sales.ss_ext_list_price: double (nullable = true) |-- store_sales.ss_ext_tax: double (nullable = true) |-- store_sales.ss_coupon_amt: double (nullable = true) |-- store_sales.ss_net_paid: double (nullable = true) |-- store_sales.ss_net_paid_inc_tax: double (nullable = true) |-- store_sales.ss_net_profit: double (nullable = true) dateDimDF.printSchema() root |-- d_date_sk: integer (nullable = false) |-- d_date_id: string (nullable = false) |-- d_date: date (nullable = true) |-- d_month_seq: integer (nullable = true) |-- d_week_seq: integer (nullable = true) |-- d_quarter_seq: integer (nullable = true) |-- d_year: integer (nullable = true) |-- d_dow: integer (nullable = true) |-- d_moy: integer (nullable = true) |-- d_dom: integer (nullable = true) |-- d_qoy: integer (nullable = true) |-- d_fy_year: integer (nullable = true) |-- d_fy_quarter_seq: integer (nullable = true) |-- d_fy_week_seq: integer (nullable = true) |-- d_day_name: string (nullable = true) |-- d_quarter_name: string (nullable = true) |-- d_holiday: string (nullable = true) |-- d_weekend: string (nullable = true) |-- d_following_holiday: string (nullable = true) |-- d_first_dom: integer (nullable = true) |-- d_last_dom: integer (nullable = true) |-- d_same_day_ly: integer (nullable = true) |-- d_same_day_lq: integer (nullable = true) |-- d_current_day: string (nullable = true) |-- d_current_week: string (nullable = true) |-- d_current_month: string (nullable = true) |-- d_current_quarter: string (nullable = true) |-- d_current_year: string (nullable = true) itemDF.printSchema() root |-- i_item_sk: integer (nullable = false) |-- i_item_id: string (nullable = false) |-- i_rec_start_date: date (nullable = true) |-- i_rec_end_date: date (nullable = true) |-- i_item_desc: string (nullable = true) |-- i_current_price: decimal (nullable = true) |-- i_wholesale_cost: decimal (nullable = true) |-- i_brand_id: integer (nullable = true) |-- i_brand: string (nullable = true) |-- i_class_id: integer (nullable = true) |-- i_class: string (nullable = true) |-- i_category_id: integer (nullable = true) |-- i_category: string (nullable = true) |-- i_manufact_id: integer (nullable = true) |-- i_manufact: string (nullable = true) |-- i_size: string (nullable = true) |-- i_formulation: string (nullable = true) |-- i_color: string (nullable = true) |-- i_units: string (nullable = true) |-- i_container: string (nullable = true) |-- i_manager_id: integer (nullable = true) |-- i_product_name: string (nullable = true) Regards, Ishwardeep From: ankitjindal [via Apache Spark User List] [mailto:ml-node+s1001560n22766...@n3.nabble.com] Sent: Tuesday, May 5, 2015 5:00 PM To: Ishwardeep Singh Subject: RE: Unable to join table across data sources using sparkSQL Just check the Schema of both the tables using frame.printSchema(); If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22766.html To unsubscribe from Unable to join table across data sources using sparkSQL, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=22761code=aXNod2FyZGVlcC5zaW5naEBpbXBldHVzLmNvLmlufDIyNzYxfDgzMDExNzI4OQ==. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for
Re: Maximum Core Utilization
Also, if not already done, you may want to try repartition your data to 50 partition s On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote: Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large number of cores available (like 100 cores), the Spark system seems to stop after using just 4 cores and after that the runtime is pretty much a straight line no matter how many more cores are thrown at it. I am wondering if Spark tries to figure out the maximum no. of cores to use based on the size of the dataset? If yes, is there a way to disable this feature and force it to use all the cores available? Thanks, Manu -- The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark. - Michelangelo
AvroFiles
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a similar issue with avro. Thanks
Re: what does Container exited with a non-zero exit code 10 means?
What Spark tarball are you using? You may want to try the one for hadoop 2.6 (the one for hadoop 2.4 may cause that issue, IIRC). On Tue, May 5, 2015 at 6:54 PM, felicia shsh...@tsmc.com wrote: Hi all, We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, and we get this error either using java or python; Application application_1430482716098_0607 failed 2 times due to AM Container for appattempt_1430482716098_0607_02 exited with exitCode: 10 due to: Exception from container-launch. Container id: container_e10_1430482716098_0607_02_01 Exit code: 10 Stack trace: ExitCodeException exitCode=10: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 10 .Failing this attempt.. Failing the application. where the detail log in nodes are: ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.IllegalArgumentException: Invalid ContainerId: container_e10_1430482716098_0607_02_01 at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182) at org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.lang.NumberFormatException: For input string: e10 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137) at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177) ... 12 more we’ve already tried the solution described as following but it doesn’t seems to work. Please advise if there’s any environmental settings we should exploit for clarifying our question, thanks! https://github.com/abhibasu/sparksql/wiki/SparkSQL-Configuration-in-CDH-5.3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-does-Container-exited-with-a-non-zero-exit-code-10-means-tp22778.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo
Re: AvroFiles
I am not using kyro. I was using the regular sqlcontext.avrofiles to open. The files loads properly with the schema. Exception happens when I try to read it. Will try kyro serializer and see if that helps. On May 5, 2015 9:02 PM, Todd Nist tsind...@gmail.com wrote: Are you using Kryo or Java serialization? I found this post useful: http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist If using kryo, you need to register the classes with kryo, something like this: sc.registerKryoClasses(Array( classOf[ConfigurationProperty], classOf[Event] )) Or create a registrator something like this: class ODSKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[ConfigurationProperty], new AvroSerializer[ConfigurationProperty]()) kryo.register(classOf[Event], new AvroSerializer[Event]())) } I encountered a similar error since several of the Avor core classes are not marked Serializable. HTH. Todd On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote: Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a similar issue with avro. Thanks
Re: saveAsTextFile() to save output of Spark program to HDFS
Another thing - could it be a permission problem ? It creates all the directory structure (in red)/tmp/wordcount/ _temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 so I am guessing not. On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty njmu...@gmail.com wrote: You are most probably right. I assumed others may have run into this. When I try to put the files in there, it creates a directory structure with the part-0 and part1 files but these files are of size 0 - no content. The client error and the server logs have the error message shown - which seem to indicate that the system is aware that a datanode exists but is excluded from the operation. So, it looks like it is not partitioned and Ambari indicates that HDFS is in good health with one NN, one SN, one DN. I am unable to figure out what the issue is. thanks for your help. On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote: What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount). Only line I added to the wordcount program is: (where 'counts' is the JavaPairRDD) *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount http://sandbox.hortonworks.com:8020/tmp/wordcount);* When I check in HDFS at that location (/tmp) here's what I find. /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0 and /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 and *both part-000[01] are 0 size files*. The wordcount client output error is: [Stage 1: (0 + 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 *could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.* at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642) I tried this with Spark 1.2.1 same error. I have plenty of space on the DFS. The Name Node, Sec Name Node the one Data Node are all healthy. Any hint as to what may be the problem ? thanks in advance. Sudarshan -- View this message in context: saveAsTextFile() to save output of Spark program to HDFS http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
RE: Multilabel Classification in spark
If you are interested in multilabel (not multiclass), you might want to take a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is supposed to perform one-versus-all transformation on classes, which is usually how multilabel classifiers are built. Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Tuesday, May 05, 2015 3:44 PM To: DB Tsai Cc: peterg; user@spark.apache.org Subject: Re: Multilabel Classification in spark If you mean multilabel (predicting multiple label values), then MLlib does not yet support that. You would need to predict each label separately. If you mean multiclass (1 label taking 2 categorical values), then MLlib supports it via LogisticRegression (as DB said), as well as DecisionTree and RandomForest. Joseph On Tue, May 5, 2015 at 1:27 PM, DB Tsai dbt...@dbtsai.commailto:dbt...@dbtsai.com wrote: LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.memailto:pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: overloaded method constructor Strategy with alternatives
Can you give us a bit more information ? Such as release of Spark you're using, version of Scala, etc. Thanks On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote: I am getting on following code Error:(164, 25) *overloaded method constructor Strategy with alternatives:* (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer])org.apache.spark.mllib.tree.configuration.Strategy and (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int],minInstancesPerNode: Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate: Double,useNodeIdCache: Boolean,checkpointInterval: Int)org.apache.spark.mllib.tree.configuration.Strategy cannot be applied to (org.apache.spark.mllib.tree.configuration.Algo.Value, org.apache.spark.mllib.tree.impurity.Gini.type, Int, Int, Int, scala.collection.immutable.Map[Int,Int]) val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, categoricalFeaturesInfo) ^ code val categoricalFeaturesInfo = Map[Int, Int]() val impurity = Gini val maxDepth = 4 val maxBins = 32 val algo = Algo.Classification val numClasses = 7 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, categoricalFeaturesInfo) /code -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-constructor-Strategy-with-alternatives-tp22777.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using spark streaming to load data from Kafka to HDFS
Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy
Re: saveAsTextFile() to save output of Spark program to HDFS
You are most probably right. I assumed others may have run into this. When I try to put the files in there, it creates a directory structure with the part-0 and part1 files but these files are of size 0 - no content. The client error and the server logs have the error message shown - which seem to indicate that the system is aware that a datanode exists but is excluded from the operation. So, it looks like it is not partitioned and Ambari indicates that HDFS is in good health with one NN, one SN, one DN. I am unable to figure out what the issue is. thanks for your help. On Tue, May 5, 2015 at 6:39 PM, ayan guha guha.a...@gmail.com wrote: What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount). Only line I added to the wordcount program is: (where 'counts' is the JavaPairRDD) *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount http://sandbox.hortonworks.com:8020/tmp/wordcount);* When I check in HDFS at that location (/tmp) here's what I find. /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0 and /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 and *both part-000[01] are 0 size files*. The wordcount client output error is: [Stage 1: (0 + 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 *could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.* at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642) I tried this with Spark 1.2.1 same error. I have plenty of space on the DFS. The Name Node, Sec Name Node the one Data Node are all healthy. Any hint as to what may be the problem ? thanks in advance. Sudarshan -- View this message in context: saveAsTextFile() to save output of Spark program to HDFS http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Join between Streaming data vs Historical Data in spark
Thanks. Since join will be done in regular basis in short period of time ( let say 20s) do you have any suggestions how to make it faster? I am thinking of partitioning data set and cache it. Rendy On Apr 30, 2015 6:31 AM, Tathagata Das t...@databricks.com wrote: Have you taken a look at the join section in the streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Let say I have transaction data and visit data visit | userId | Visit source | Timestamp | | A | google ads | 1 | | A | facebook ads | 2 | transaction | userId | total price | timestamp | | A | 100 | 248384| | B | 200 | 43298739 | I want to join transaction data and visit data to do sales attribution. I want to do it realtime whenever transaction occurs (streaming). Is it scalable to do join between one data and very big historical data using join function in spark? If it is not, then how it usually be done? Visit needs to be historical, since visit can be anytime before transaction (e.g. visit is one year before transaction occurs) Rendy
MLlib libsvm isssues with data
hi all: I’ve met a issues with MLlib.I used posted to the community seems put the wrong place:( .Then I put in stackoverflowf.for a good format details plz seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope someone could help I guess it’s due to my data.but I’ve test it in libsvm-tools it worked well,and I’ve used the libsvm data python data format test tool and it’s ok.Just don’t know why it errors withjava.lang.ArrayIndexOutOfBoundsException: -1:( And this is my first time using the mail list ask for help.If I did something wrong or I described not clearly plz tell me. doye
what does Container exited with a non-zero exit code 10 means?
Hi all, We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, and we get this error either using java or python; Application application_1430482716098_0607 failed 2 times due to AM Container for appattempt_1430482716098_0607_02 exited with exitCode: 10 due to: Exception from container-launch. Container id: container_e10_1430482716098_0607_02_01 Exit code: 10 Stack trace: ExitCodeException exitCode=10: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 10 .Failing this attempt.. Failing the application. where the detail log in nodes are: ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.IllegalArgumentException: Invalid ContainerId: container_e10_1430482716098_0607_02_01 at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182) at org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:83) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:576) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:574) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:597) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.lang.NumberFormatException: For input string: e10 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137) at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177) ... 12 more we’ve already tried the solution described as following but it doesn’t seems to work. Please advise if there’s any environmental settings we should exploit for clarifying our question, thanks! https://github.com/abhibasu/sparksql/wiki/SparkSQL-Configuration-in-CDH-5.3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-does-Container-exited-with-a-non-zero-exit-code-10-means-tp22778.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Two DataFrames with different schema, unionAll issue.
You need to add a select clause to at least one dataframe to give them the same schema before you can union them (much like in SQL). On Tue, May 5, 2015 at 3:24 AM, Wilhelm niznik.pa...@gmail.com wrote: Hey there, 1.) I'm loading 2 avro files with that have slightly different schema df1 = sqlc.load(file1, com.databricks.spark.avro) df2 = sqlc.load(file2, com.databricks.spark.avro) 2.) I want to unionAll them nfd = dfs1.unionAll(dfs2) 3.) Getting the following error --- Py4JJavaError Traceback (most recent call last) ipython-input-190-a86d9adbea83 in module() 17 18 --- 19 nfd = dfs1.unionAll(dfs2) 20 21 /home/hadoop/spark/python/pyspark/sql/dataframe.pyc in unionAll(self, other) 669 This is equivalent to `UNION ALL` in SQL. 670 -- 671 return DataFrame(self._jdf.unionAll(other._jdf), self.sql_ctx) 672 673 def intersect(self, other): /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o76196.unionAll. : org.apache.spark.sql.AnalysisException: unresolved operator 'Union ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:97) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.unionAll(DataFrame.scala:641) at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) --- 4.) Is it possible to automatically merge 2 DFs with different schemas like that? Am I doing sth. wrong? Much appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Two-DataFrames-with-different-schema-unionAll-issue-tp22765.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: overloaded method constructor Strategy with alternatives
I am using Spark 1.3.0 and Scala 2.10. Thanks On Tue, May 5, 2015 at 6:48 PM, Ted Yu yuzhih...@gmail.com wrote: Can you give us a bit more information ? Such as release of Spark you're using, version of Scala, etc. Thanks On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote: I am getting on following code Error:(164, 25) *overloaded method constructor Strategy with alternatives:* (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer])org.apache.spark.mllib.tree.configuration.Strategy and (algo: org.apache.spark.mllib.tree.configuration.Algo.Algo,impurity: org.apache.spark.mllib.tree.impurity.Impurity,maxDepth: Int,numClasses: Int,maxBins: Int,quantileCalculationStrategy: org.apache.spark.mllib.tree.configuration.QuantileStrategy.QuantileStrategy,categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int],minInstancesPerNode: Int,minInfoGain: Double,maxMemoryInMB: Int,subsamplingRate: Double,useNodeIdCache: Boolean,checkpointInterval: Int)org.apache.spark.mllib.tree.configuration.Strategy cannot be applied to (org.apache.spark.mllib.tree.configuration.Algo.Value, org.apache.spark.mllib.tree.impurity.Gini.type, Int, Int, Int, scala.collection.immutable.Map[Int,Int]) val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, categoricalFeaturesInfo) ^ code val categoricalFeaturesInfo = Map[Int, Int]() val impurity = Gini val maxDepth = 4 val maxBins = 32 val algo = Algo.Classification val numClasses = 7 val dTreeStrategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins, categoricalFeaturesInfo) /code -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/overloaded-method-constructor-Strategy-with-alternatives-tp22777.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Possible to use hive-config.xml instead of hive-site.xml for HiveContext?
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-config-xml-instead-of-hive-site-xml-for-HiveContext-tp22776.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: AvroFiles
Are you using Kryo or Java serialization? I found this post useful: http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist If using kryo, you need to register the classes with kryo, something like this: sc.registerKryoClasses(Array( classOf[ConfigurationProperty], classOf[Event] )) Or create a registrator something like this: class ODSKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[ConfigurationProperty], new AvroSerializer[ConfigurationProperty]()) kryo.register(classOf[Event], new AvroSerializer[Event]())) } I encountered a similar error since several of the Avor core classes are not marked Serializable. HTH. Todd On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote: Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a similar issue with avro. Thanks
Re: Using spark streaming to load data from Kafka to HDFS
why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy
Re: How to skip corrupted avro files
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but you could copy the code for HadoopReliableRDD in the PR into your own code and use it, without having to wait for the issue to get resolved. On Sun, May 3, 2015 at 12:57 PM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am using Spark 1.3.1 to read a directory of about 2000 avro files. The avro files are from a third party and a few of them are corrupted. val path = {myDirecotry of avro files} val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local) val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val data = sqlContext.avroFile(path); data.select(.) When I run the above code, I get the following exception. org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222) ~[classes/:1.7.7] at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:245) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:212) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) ~[spark-core_2.10-1.3.1.jar:1.3.1] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129) ~[spark-sql_2.10-1.3.1.jar:1.3.1] at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) ~[spark-sql_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[spark-core_2.10-1.3.1.jar:1.3.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:314) ~[classes/:1.7.7] at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:209) ~[classes/:1.7.7] ... 25 common frames omitted Is there an easy way to skip a corrupted avro file without reading the files one by one using sqlContext.avroFile(file) ? At present, my solution (hack) is to have my own version of org.apache.avro.file.DataFileStream with method hasNext returns false ( to signal the end file), when java.io.IOException: Invalid sync! is thrown. Please see line 210 in https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java Thanks in advance for any assistance ! Shing
Re: Parquet number of partitions
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x= number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel. On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom eric.eijkelenb...@gmail.com wrote: Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count from the spark-shell causes job execution to hang… Any ideas? Thank you in advance. Eric - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Saludos. Miguel Ángel
Re: JAVA for SPARK certification
Very interested @Kartik/Zoltan. Please let me know how to connect on LI On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com wrote: Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav -- Best Regards, Ayan Guha
Re: How to deal with code that runs before foreach block in Apache Spark?
Gerard is totally correct -- to expand a little more, I think what you want to do is a solrInputDocumentJavaRDD.foreachPartition, instead of solrInputDocumentJavaRDD.foreach: solrInputDocumentJavaRDD.foreachPartition( new VoidFunctionIteratorSolrInputDocument() { @Override public void call(IteratorSolrInputDocument docItr) { ListSolrInputDocument docs = new ArrayListSolrInputDocument(); for(SolrInputDocument solrInputDocument: docItr) { // Add the solrInputDocument to the list of SolrInputDocuments docs.add(solrInputDocument); } // push things to solr **from the executor, for this partition** // so for this make sense, you need to be sure solr can handle a bunch // of executors pushing into it simultaneously. addThingsToSolr(docs); } }); On Mon, May 4, 2015 at 8:44 AM, Gerard Maas gerard.m...@gmail.com wrote: I'm not familiar with the Solr API but provided that ' SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to: SolrIndexerDriver.solrInputDocumentList.add(elem) is happening on different singleton instances of the SolrIndexerDriver on different JVMs while SolrIndexerDriver.solrServer.commit is happening on the driver. In practical terms, the lists on the executors are being filled-in but they are never committed and on the driver the opposite is happening. -kr, Gerard On Mon, May 4, 2015 at 3:34 PM, Emre Sevinc emre.sev...@gmail.com wrote: I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr. This works perfectly fine when I run the following code in stand-alone mode of Spark, but does not work when the same code is run on a cluster. When I run the same code on a cluster, it is like send to Solr part of the code is executed before the list to be sent to Solr is filled with items. I try to force the execution by solrInputDocumentJavaRDD.collect(); after foreach, but it seems like it does not have any effect. // For each RDD solrInputDocumentJavaDStream.foreachRDD( new FunctionJavaRDDSolrInputDocument, Void() { @Override public Void call(JavaRDDSolrInputDocument solrInputDocumentJavaRDD) throws Exception { // For each item in a single RDD solrInputDocumentJavaRDD.foreach( new VoidFunctionSolrInputDocument() { @Override public void call(SolrInputDocument solrInputDocument) { // Add the solrInputDocument to the list of SolrInputDocuments SolrIndexerDriver.solrInputDocumentList.add(solrInputDocument); } }); // Try to force execution solrInputDocumentJavaRDD.collect(); // After having finished adding every SolrInputDocument to the list // add it to the solrServer, and commit, waiting for the commit to be flushed try { // Seems like when run in cluster mode, the list size is zero, // therefore the following part is never executed if (SolrIndexerDriver.solrInputDocumentList != null SolrIndexerDriver.solrInputDocumentList.size() 0) { SolrIndexerDriver.solrServer.add(SolrIndexerDriver.solrInputDocumentList); SolrIndexerDriver.solrServer.commit(true, true); SolrIndexerDriver.solrInputDocumentList.clear(); } } catch (SolrServerException | IOException e) { e.printStackTrace(); } return null; } } ); What should I do, so that sending-to-Solr part executes after the list of SolrDocuments are added to solrInputDocumentList (and works also in cluster mode)? -- Emre Sevinç
Escaping user input for Hive queries
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to generate a properly escaped sql statement... Wondering if someone has ideas on proper way to do this? To be concrete, I would love to issue this statement val df = myHiveCtxt.(sqlText) but I would like to defend against potential SQL injection.
Possible to disable Spark HTTP server ?
Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark job concurrency problem
can you give your entire spark submit command? Are you missing --executor-cores num_cpu? Also, if you intend to use all 6 nodes, you also need --num-executors 6 On Mon, May 4, 2015 at 2:07 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I have two small RDD, each has about 600 records. In my code, I did val rdd1 = sc...cache() val rdd2 = sc...cache() val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) = some_expensive_job(a,b) } I ran my job in YARN cluster with --master yarn-cluster, I have 6 executor, and each has a large memory volume. However, I noticed my job is very slow. I went to the RM page, and found there are two containers, one is the driver, one is the worker. I guess this is correct? I went to the worker's log, and monitor the log detail. My app print some information, so I can use them to estimate the progress of the map operation. Looking at the log, it feels like the jobs are done one by one sequentially, rather than #cpu batch at a time. I checked the worker node, and their CPU are all busy. [image: --] Xi Shen [image: http://]about.me/davidshen http://about.me/davidshen?promo=email_sig http://about.me/davidshen
Where does Spark persist RDDs on disk?
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank you.
Where does Spark persist RDDs on disk?
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-does-Spark-persist-RDDs-on-disk-tp22771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:
Are you setting a really large max buffer size for kryo? Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ? If not, we should open up another issue to get a better warning in these cases. On Tue, May 5, 2015 at 2:47 AM, shahab shahab.mok...@gmail.com wrote: Thanks Tristan for sharing this. Actually this happens when I am reading a csv file of 3.5 GB. best, /Shahab On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org wrote: Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size exception), and I believe it’s a really bug in Kryo. See this thread: http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E Manifests in all of the following situations when working with an object graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop save APIs. Tristan On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote: Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit : Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab
Re: How to separate messages of different topics.
Make sure to read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md The directStream / KafkaRDD has a 1 : 1 relationship between kafka topic/partition and spark partition. So a given spark partition only has messages from 1 kafka topic. You can tell what topic that is using HasOffsetRanges, as discussed in the post. This 1 : 1 relationship only holds until the first transformation that incurs a shuffle. On Tue, May 5, 2015 at 8:29 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I want to read from many topics in Kafka and know from where each message is coming (topic1, topic2 and so on). val kafkaParams = Map[String, String](metadata.broker.list - myKafka:9092) val topics = Set(EntryLog, presOpManager) val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) Is there some way to separate the messages for topics with just one directStream? or should I create different streamings for each topic?
Parquet number of partitions
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count from the spark-shell causes job execution to hang… Any ideas? Thank you in advance. Eric - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JAVA for SPARK certification
Hi, I think all the required materials for reference are mentioned here: http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification My question was regarding the proficiency level required for Java. There are detailed examples and code mentioned for JAVA, Python and Scala in most of the SCALA tutorials mentioned in the above link for reference. Regards, Gourav On Tue, May 5, 2015 at 3:03 PM, ayan guha guha.a...@gmail.com wrote: Very interested @Kartik/Zoltan. Please let me know how to connect on LI On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com wrote: Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav -- Best Regards, Ayan Guha
[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~ Sep 2015): will NOT work with Java 6, but work with Java 7, 8. PS: Oracle ended Java 6 updates in Feb 2013.
Re: saveAsTextFile() to save output of Spark program to HDFS
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount). Only line I added to the wordcount program is: (where 'counts' is the JavaPairRDD) *counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount http://sandbox.hortonworks.com:8020/tmp/wordcount);* When I check in HDFS at that location (/tmp) here's what I find. /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0 and /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 and *both part-000[01] are 0 size files*. The wordcount client output error is: [Stage 1: (0 + 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 *could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.* at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642) I tried this with Spark 1.2.1 same error. I have plenty of space on the DFS. The Name Node, Sec Name Node the one Data Node are all healthy. Any hint as to what may be the problem ? thanks in advance. Sudarshan -- View this message in context: saveAsTextFile() to save output of Spark program to HDFS http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Multilabel Classification in spark
If you mean multilabel (predicting multiple label values), then MLlib does not yet support that. You would need to predict each label separately. If you mean multiclass (1 label taking 2 categorical values), then MLlib supports it via LogisticRegression (as DB said), as well as DecisionTree and RandomForest. Joseph On Tue, May 5, 2015 at 1:27 PM, DB Tsai dbt...@dbtsai.com wrote: LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Standalone mode missing parquet?
Hi All, When I try and run Spark SQL in standalone mode it appears to be missing the parquet jar, I have to pass it as -jars and that works.. sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar --driver-memory 28g --master local[10] Any ideas on why? I downloaded the one pre built for Hadoop 2.6 and later thanks, Manu
Re: Possible to disable Spark HTTP server ?
SPARK-3490 introduced spark.ui.enabled FYI On Tue, May 5, 2015 at 8:41 AM, roy rp...@njit.edu wrote: Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to skip corrupted avro files
Thanks for the info ! Shing On Tuesday, 5 May 2015, 15:11, Imran Rashid iras...@cloudera.com wrote: You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but you could copy the code for HadoopReliableRDD in the PR into your own code and use it, without having to wait for the issue to get resolved. On Sun, May 3, 2015 at 12:57 PM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am using Spark 1.3.1 to read a directory of about 2000 avro files. The avro files are from a third party and a few of them are corrupted. val path = {myDirecotry of avro files} val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local) val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val data = sqlContext.avroFile(path); data.select(.) When I run the above code, I get the following exception. org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:222) ~[classes/:1.7.7] at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) ~[avro-mapred-1.7.7-hadoop2.jar:1.7.7] at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:245) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:212) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) ~[spark-core_2.10-1.3.1.jar:1.3.1] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129) ~[spark-sql_2.10-1.3.1.jar:1.3.1] at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) ~[spark-sql_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[spark-core_2.10-1.3.1.jar:1.3.1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[spark-core_2.10-1.3.1.jar:1.3.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]Caused by: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:314) ~[classes/:1.7.7] at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:209) ~[classes/:1.7.7] ... 25 common frames omitted Is there an easy way to skip a corrupted avro file without reading the files one by one using sqlContext.avroFile(file) ? At present, my solution (hack) is to have my own version of org.apache.avro.file.DataFileStream with method hasNext returns false (to signal the end file), when java.io.IOException: Invalid sync! is thrown. Please see line 210 in https://github.com/apache/avro/blob/branch-1.7/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java Thanks in advance for any assistance ! Shing
Inserting Nulls
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1: Option[Ingeger] = None* *...* *val myRow = org.apache.spark.sql.Row(col1, ...)* *val mySchema = StructType(Array(StructField(Column1, IntegerType, true)))* *val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)* *hc.registerRDDAsTable(TableOutputSchemaRDD, result_intermediate)* *hc.sql(CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM result_intermediate)* Produce: java.lang.ClassCastException: scala.Some cannot be cast to java.lang.Integer at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks! -- Regards. Miguel Ángel
Spark applications Web UI at 4040 doesn't exist
Hi all, I'm not able to access to the Spark Streaming running applications that I'm submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The problem is that I don't even see running applications in the master's web UI (I do see running drivers). This is the command I use to submit the app: bin/spark-submit --class my.domain.mainClass --deploy-mode cluster --conf spark.executor.logs.rolling.size.maxBytes=40 spark.executor.logs.rolling.strategy=size spark.executor.logs.rolling.maxRetainedFiles=4 /home/ec2-user/application.jar The url I'm trying to access to see that web UI is the driver EC2 public address with port 4040. I also tried the solution provided in this post http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Web-UI-is-not-showing-Running-Completed-Active-Applications-td18475.html with no luck. Thanks in advance mates!! Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-applications-Web-UI-at-4040-doesn-t-exist-tp22773.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Number of files to load
Let say I am storing my data in HDFS with folder structure and file partitioning as per below: /analytics/2015/05/02/partition-2015-05-02-13-50- Note that new file is created every 5 minutes. As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes. In the other hand, when we want to aggregate monthly data, number of file will be enormous (around 84000 files). My question is, what are the consideration to say that the number of file to be loaded to RDD is just 'too many'? Is 84000 'too many' files? One thing that comes to my mind is overhead when spark try to open file, however Im not sure whether it is a valid concern. Rendy
Re: Number of files to load
As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes. This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior rendy.b.jun...@gmail.com: Let say I am storing my data in HDFS with folder structure and file partitioning as per below: /analytics/2015/05/02/partition-2015-05-02-13-50- Note that new file is created every 5 minutes. As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes. In the other hand, when we want to aggregate monthly data, number of file will be enormous (around 84000 files). My question is, what are the consideration to say that the number of file to be loaded to RDD is just 'too many'? Is 84000 'too many' files? One thing that comes to my mind is overhead when spark try to open file, however Im not sure whether it is a valid concern. Rendy
saveAsTextFile() to save output of Spark program to HDFS
I have searched all replies to this question not found an answer.I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount).Only line I added to the wordcount program is: (where 'counts' is the JavaPairRDD)*counts.saveAsTextFile(hdfs://sandbox.hortonworks.com:8020/tmp/wordcount);*When I check in HDFS at that location (/tmp) here's what I find./tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_00_2/part-0and/tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1and *both part-000[01] are 0 size files*.The wordcount client output error is:[Stage 1: (0 + 2) / 2]15/05/05 14:40:45 WARN DFSClient: DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/wordcount/_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 *could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.* at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3447) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:642)I tried this with Spark 1.2.1 same error.I have plenty of space on the DFS.The Name Node, Sec Name Node the one Data Node are all healthy.Any hint as to what may be the problem ?thanks in advance.Sudarshan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-to-save-output-of-Spark-program-to-HDFS-tp22774.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Inserting Nulls
Option only works when you are going from case classes. Just put null into the Row, when you want the value to be null. On Tue, May 5, 2015 at 9:00 AM, Masf masfwo...@gmail.com wrote: Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1: Option[Ingeger] = None* *...* *val myRow = org.apache.spark.sql.Row(col1, ...)* *val mySchema = StructType(Array(StructField(Column1, IntegerType, true)))* *val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)* *hc.registerRDDAsTable(TableOutputSchemaRDD, result_intermediate)* *hc.sql(CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM result_intermediate)* Produce: java.lang.ClassCastException: scala.Some cannot be cast to java.lang.Integer at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks! -- Regards. Miguel Ángel
Re: spark sql, creating literal columns in java.
This should work from java too: http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$ On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann janpaulbultm...@me.com wrote: Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Parquet Partition Strategy - how to partition data correctly
Hi, I have a DataFrame that represents my data looks like this: +-++ | col_name| data_type | +-++ | obj_id | string | | type| string | | name| string | | metric_name | string | | value | double | | ts | timestamp | +-++ It is working fine, and I can store it to parquet with: df.saveAsParquetFile(/user/data/metrics) I would like to leverage parquet partitioning as referenced here, https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files I would like to see a representation something like this: usr |__ data |__ metrics |__ type=Virtual Machine |__ objId=1234 |__ metricName=CPU Demand |__ mmdd |__ data.parquet |__ metricName=CPU Utilization |__ mmdd |__ data.parquet |__ objId=5678 |__ metricName=CPU Demand |__ mmdd |__ data.parquet |__ type=Application |__ objId=0009 |__ metricName=Response Time |__ mmdd |__ data.parquet |__ metricName=Slow Response |__ mmdd |__ data.parquet |__ objId=0303 |__ metricName=Response Time |__ mmdd |__ data.parquet What is the correct way to achieve this? I can do something like: df.map{ case Row(nodeType: String, objId: String, name: String, metricName: String, value: Double, ts: java.sql.Timestamp) = ... // construct path val path = s/usr/data/metrics/type=${Row.nodeType}/objId=${Row.objId}/metricName=${Row.metricName}/floorToDay(ts) // save record as parquet df.saveAsParquet(path, Row) ...} Is this the right approach or is there a more optimal approach? This would save every row as an individual file. I will receive multiple entries for a given metric, type and objId combination in a given day. TIA for the assistance. -Todd
Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar) ?? 2015??5??53:32 980548...@qq.com ?? my metastore is like this property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://192.168.1.40:3306/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionUserName/name valuehive/value /property property namejavax.jdo.option.ConnectionPassword/name valuehive123!@#/value /property property namehive.metastore.warehouse.dir/name value/user/${user.name}/hive-warehouse/value descriptionlocation of default database for the warehouse/description /property -- -- ??: Wang, Daoyuan;daoyuan.w...@intel.com; : 2015??5??5??(??) 3:22 ??: ??980548...@qq.com; luohui20001luohui20...@sina.com; : useruser@spark.apache.org; : RE: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient How did you configure your metastore? Thanks, Daoyuan ?0?2 From: ?? [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy hive-site.xml to spark conf directory and I have the authenticated to access hive metastore warehouse; Thanks , Best regards! -- -- ??: luohui20001;luohui20...@sina.com mailto:luohui20...@sina.com; : 2015??5??5??(??) 9:56 ??: ??980548...@qq.com mailto:980548...@qq.com; useruser@spark.apache.org mailto:user@spark.apache.org; : ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! San.Luo - - ?? 980548...@qq.com mailto:980548...@qq.com user user@spark.apache.org mailto:user@spark.apache.org ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ??2015??05??05?? 08??49?? hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank any help !
Re: Re: sparksql running slow while joining 2 tables.
Can you activate your eventLogs and send them us ? Thank you ! Le mar. 5 mai 2015 à 04:56, luohui20001 luohui20...@sina.com a écrit : Yes,just by default 1 executor.thanks 发自我的小米手机 在 2015年5月4日 下午10:01,ayan guha guha.a...@gmail.com写道: Are you using only 1 executor? On Mon, May 4, 2015 at 11:07 PM, luohui20...@sina.com wrote: hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics . it seems like a GC issue. I also tried with different parameters like memory size of driverexecutor, memory fraction, java opts... but this issue still happens. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Olivier Girardot ssab...@gmail.com 收件人:luohui20...@sina.com, user user@spark.apache.org 主题:Re: sparksql running slow while joining 2 tables. 日期:2015年05月04日 20点46分 Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, luohui20...@sina.com a écrit : hi guys when i am running a sql like select a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name = b.name) where (b.startpoint a.startpoint + 25); I found sparksql running slow in minutes which may caused by very long GC and shuffle time. table db is created from a txt file size at 56mb while table sample sized at 26mb, both at small size. my spark cluster is a standalone pseudo-distributed spark cluster with 8g executor and 4g driver manager. any advises? thank you guys. Thanksamp;Best regards! 罗辉 San.Luo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha
Re: Is LIMIT n in Spark SQL useful?
Michael Are there plans to add LIMIT push down? It's quite a natural thing to do in interactive querying. Sent from my iPhone On 4 May 2015, at 22:57, Michael Armbrust mich...@databricks.com wrote: The JDBC interface for Spark SQL does not support pushing down limits today. On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote: and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote: What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before identifying the first n rows On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote: I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and improve query performance, but I found it took long time as same as querying not using LIMIT. It let me confused. Anybody know why? Thanks. Regards, Yi
Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:
Thanks Tristan for sharing this. Actually this happens when I am reading a csv file of 3.5 GB. best, /Shahab On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org wrote: Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size exception), and I believe it’s a really bug in Kryo. See this thread: http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E Manifests in all of the following situations when working with an object graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop save APIs. Tristan On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote: Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit : Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab
Re: java.io.IOException: No space left on device while doing repartitioning in Spark
It could be filling up your /tmp directory. You need to set your spark.local.dir or you can also specify SPARK_WORKER_DIR to another location which has sufficient space. Thanks Best Regards On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am getting No space left on device exception when doing repartitioning of approx. 285 MB of data while these is still 2 GB space left ?? does it mean that repartitioning needs more space (more than 2 GB) for repartitioning of 285 MB of data ?? best, /Shahab java.io.IOException: No space left on device at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205) at sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:473) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:569) at org.apache.spark.util.Utils$.copyStream(Utils.scala:331) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:
Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size exception), and I believe it’s a really bug in Kryo. See this thread: http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccag02ijuw3oqbi2t8acb5nlrvxso2xmas1qrqd_4fq1tgvvj...@mail.gmail.com%3E Manifests in all of the following situations when working with an object graph in excess of a few GB: Joins, Broadcasts, and when using the hadoop save APIs. Tristan On 3 May 2015 at 07:26, Olivier Girardot ssab...@gmail.com wrote: Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab shahab.mok...@gmail.com a écrit : Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab
RE: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
How did you configure your metastore? Thanks, Daoyuan From: 鹰 [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy hive-site.xml to spark conf directory and I have the authenticated to access hive metastore warehouse; Thanks , Best regards! -- 原始邮件 -- 发件人: luohui20001;luohui20...@sina.commailto:luohui20...@sina.com; 发送时间: 2015年5月5日(星期二) 上午9:56 收件人: 鹰980548...@qq.commailto:980548...@qq.com; useruser@spark.apache.orgmailto:user@spark.apache.org; 主题: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:鹰 980548...@qq.commailto:980548...@qq.com 收件人:user user@spark.apache.orgmailto:user@spark.apache.org 主题:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient 日期:2015年05月05日 08点49分 hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank any help !
??????RE: ??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
my metastore is like this property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://192.168.1.40:3306/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionUserName/name valuehive/value /property property namejavax.jdo.option.ConnectionPassword/name valuehive123!@#/value /property property namehive.metastore.warehouse.dir/name value/user/${user.name}/hive-warehouse/value descriptionlocation of default database for the warehouse/description /property -- -- ??: Wang, Daoyuan;daoyuan.w...@intel.com; : 2015??5??5??(??) 3:22 ??: ??980548...@qq.com; luohui20001luohui20...@sina.com; : useruser@spark.apache.org; : RE: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient How did you configure your metastore? Thanks, Daoyuan From: ?? [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy hive-site.xml to spark conf directory and I have the authenticated to access hive metastore warehouse; Thanks , Best regards! -- -- ??: luohui20001;luohui20...@sina.com; : 2015??5??5??(??) 9:56 ??: ??980548...@qq.com; useruser@spark.apache.org; : ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! San.Luo - - ?? 980548...@qq.com user user@spark.apache.org ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ??2015??05??05?? 08??49?? hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank any help !
Re: Spark partitioning question
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com wrote: Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned on partition 0 but then I see this record arriving in both Yarn containers (I see it in the logs). Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey seemed a natural fit ( ... I am aware of its limitations). Thanks, Marius On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote: Hi Marius, I am also a little confused -- are you saying that myPartitions is basically something like: class MyPartitioner extends Partitioner { def numPartitions = 1 def getPartition(key: Any) = 0 } ?? If so, I don't understand how you'd ever end up data in two partitions. Indeed, than everything before the call to partitionBy(myPartitioner) is somewhat irrelevant. The important point is the partitionsBy should put all the data in one partition, and then the operations after that do not move data between partitions. so if you're really observing data in two partitions, then it would good to know more about what version of spark you are on, your data etc. as it sounds like a bug. But, I have a feeling there is some misunderstanding about what your partitioner is doing. Eg., I think doing groupByKey followed by sortByKey doesn't make a lot of sense -- in general one sortByKey is all you need (its not exactly the same, but most probably close enough, and avoids doing another expensive shuffle). If you can share a bit more information on your partitioner, and what properties you need for your f, that might help. thanks, Imran On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com wrote: Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input data has 2 input splits (yarn 2.6.0). myPartitioner partitions all the records on partition 0, which is correct, so the intuition is that f provided to the last transformation (mapPartitionsToPair) would run sequentially inside a single yarn container. However from yarn logs I do see that both yarn containers are processing records from the same partition ... and *sometimes* the over all job fails (due to the code in f which expects a certain order of records) and yarn container 1 receives the records as expected, whereas yarn container 2 receives a subset of records ... for a reason I cannot explain and f fails. The overall behavior of this job is that sometimes it succeeds and sometimes it fails ... apparently due to inconsistent propagation of sorted records to yarn containers. If any of this makes any sense to you, please let me know what I am missing. Best, Marius
Re: Spark partitioning question
Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned on partition 0 but then I see this record arriving in both Yarn containers (I see it in the logs). Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey seemed a natural fit ( ... I am aware of its limitations). Thanks, Marius On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote: Hi Marius, I am also a little confused -- are you saying that myPartitions is basically something like: class MyPartitioner extends Partitioner { def numPartitions = 1 def getPartition(key: Any) = 0 } ?? If so, I don't understand how you'd ever end up data in two partitions. Indeed, than everything before the call to partitionBy(myPartitioner) is somewhat irrelevant. The important point is the partitionsBy should put all the data in one partition, and then the operations after that do not move data between partitions. so if you're really observing data in two partitions, then it would good to know more about what version of spark you are on, your data etc. as it sounds like a bug. But, I have a feeling there is some misunderstanding about what your partitioner is doing. Eg., I think doing groupByKey followed by sortByKey doesn't make a lot of sense -- in general one sortByKey is all you need (its not exactly the same, but most probably close enough, and avoids doing another expensive shuffle). If you can share a bit more information on your partitioner, and what properties you need for your f, that might help. thanks, Imran On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com wrote: Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input data has 2 input splits (yarn 2.6.0). myPartitioner partitions all the records on partition 0, which is correct, so the intuition is that f provided to the last transformation (mapPartitionsToPair) would run sequentially inside a single yarn container. However from yarn logs I do see that both yarn containers are processing records from the same partition ... and *sometimes* the over all job fails (due to the code in f which expects a certain order of records) and yarn container 1 receives the records as expected, whereas yarn container 2 receives a subset of records ... for a reason I cannot explain and f fails. The overall behavior of this job is that sometimes it succeeds and sometimes it fails ... apparently due to inconsistent propagation of sorted records to yarn containers. If any of this makes any sense to you, please let me know what I am missing. Best, Marius
??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy hive-site.xml to spark conf directory and I have the authenticated to access hive metastore warehouse; Thanks , Best regards! -- -- ??: luohui20001;luohui20...@sina.com; : 2015??5??5??(??) 9:56 ??: ??980548...@qq.com; useruser@spark.apache.org; : ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! San.Luo - - ?? 980548...@qq.com user user@spark.apache.org ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ??2015??05??05?? 08??49?? hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank any help !
?????? spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
thanks jeanlyn itworks -- -- ??: jeanlyn;oujianl...@jd.com; : 2015??5??5??(??) 3:40 ??: ??980548...@qq.com; : Wang, Daoyuandaoyuan.w...@intel.com; useruser@spark.apache.org; : Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar) ?? 2015??5??53:32 980548...@qq.com ?? my metastore is like this property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://192.168.1.40:3306/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionDriver class name for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionUserName/name valuehive/value /property property namejavax.jdo.option.ConnectionPassword/name valuehive123!@#/value /property property namehive.metastore.warehouse.dir/name value/user/${user.name}/hive-warehouse/value descriptionlocation of default database for the warehouse/description /property -- -- ??: Wang, Daoyuan;daoyuan.w...@intel.com; : 2015??5??5??(??) 3:22 ??: ??980548...@qq.com; luohui20001luohui20...@sina.com; : useruser@spark.apache.org; : RE: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient How did you configure your metastore? Thanks, Daoyuan From: ?? [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy hive-site.xml to spark conf directory and I have the authenticated to access hive metastore warehouse; Thanks , Best regards! -- -- ??: luohui20001;luohui20...@sina.com; : 2015??5??5??(??) 9:56 ??: ??980548...@qq.com; useruser@spark.apache.org; : ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! San.Luo - - ?? 980548...@qq.com user user@spark.apache.org ??spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ??2015??05??05?? 08??49?? hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank any help !
RE: Unable to join table across data sources using sparkSQL
Hi , I am using Spark 1.3.0. I was able to join a JSON file on HDFS registered as a TempTable with a table in MySQL. On the same lines I tried to join a table in Hive with another table in Teradata but I get a query parse exception. Regards, Ishwardeep From: ankitjindal [via Apache Spark User List] [mailto:ml-node+s1001560n22762...@n3.nabble.com] Sent: Tuesday, May 5, 2015 1:26 PM To: Ishwardeep Singh Subject: Re: Unable to join table across data sources using sparkSQL Hi I was doing the same but with a file in hadoop as a temp table and one table in sql server but i succeeded in it. Which spark version are you using currently? Thanks Ankit If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22762.html To unsubscribe from Unable to join table across data sources using sparkSQL, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=22761code=aXNod2FyZGVlcC5zaW5naEBpbXBldHVzLmNvLmlufDIyNzYxfDgzMDExNzI4OQ==. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22763.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Event generator for SPARK-Streaming from csv
I know these methods , but i need to create events using the timestamps in the data tuples ,means every time a new tuple is generated using the timestamp in a CSV file .this will be useful to simulate the data rate with time just like real sensor data . On Fri, May 1, 2015 at 2:52 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, Maybe you could use streamingContext.fileStream like in the example from https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers, you can read from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.). You could split the file into several smaller files, and move them to the target folder one by one with some sleep time in between to simulate a stream of data with custom granularity. Hope that helps, Greetings, Juan 2015-05-01 9:30 GMT+02:00 anshu shukla anshushuk...@gmail.com: I have the real DEBS-TAxi data in csv file , in order to operate over it how to simulate a Spout kind of thing as event generator using the timestamps in CSV file. -- Thanks Regards, Anshu Shukla -- Thanks Regards, Anshu Shukla
Maximum Core Utilization
Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large number of cores available (like 100 cores), the Spark system seems to stop after using just 4 cores and after that the runtime is pretty much a straight line no matter how many more cores are thrown at it. I am wondering if Spark tries to figure out the maximum no. of cores to use based on the size of the dataset? If yes, is there a way to disable this feature and force it to use all the cores available? Thanks, Manu -- The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark. - Michelangelo
RE: Remoting warning when submitting to cluster
I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and desktop 8.1) Here’s what I’m running: Desktop: Spark Master (%SPARK_HOME%\bin\spark-class2.cmd org.apache.spark.deploy.master.Master -h desktop --port 7077) Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd org.apache.spark.deploy.worker.Worker spark://desktop:7077) Kafka Broker ZooKeeper Server Laptop: 2 Kafka Producers each sending to a unique topic to broker running on Desktop Driver App In this scenario, I get no messages showing up in the Driver App’s console. If on the other hand, I either move the driver app to the desktop or run the worker on the laptop instead of the desktop, then I see the counts as expected (meaning the driver and the worker/executor are on the same machine). When I moved this scenario to a set of machines in a separate network, separating the executor and driver worked as expected. So it seems a networking issue was causing the failure. Now to the followup question: which property do I set to configure the port so that I can ensure it’s a port that isn’t blocked by Systems? The candidates I see: spark.blockManager.port spark.blockManager.port spark.driver.host spark.driver.port spark.executor.port -Javier From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, May 4, 2015 12:42 AM To: Javier Delgadillo Cc: user@spark.apache.org Subject: Re: Remoting warning when submitting to cluster Looks like a version incompatibility, just make sure you have the proper version of spark. Also look further in the stacktrace what is causing Futures timed out (it could be a network issue also if the ports aren't opened properly) Thanks Best Regards On Sat, May 2, 2015 at 12:04 AM, javidelgadillo jdelgadi...@esri.commailto:jdelgadi...@esri.com wrote: Hello all!! We've been prototyping some spark applications to read messages from Kafka topics. The application is quite simple, we use KafkaUtils.createStream to receive a stream of CSV messages from a Kafka Topic. We parse the CSV and count the number of messages we get in each RDD. At a high-level (removing the abstractions of our appliction), it looks like this: val sc = new SparkConf() .setAppName(appName) .set(spark.executor.memory, 1024m) .set(spark.cores.max, 3) .set(spark.app.namehttp://spark.app.name, appName) .set(spark.ui.port, sparkUIPort) val ssc = new StreamingContext(sc, Milliseconds(emitInterval.toInt)) KafkaUtils .createStream(ssc, zookeeperQuorum, consumerGroup, topicMap) .map(_._2) .foreachRDD( (rdd:RDD, time: Time) = { println(Time %s: (%s total records).format(time, rdd.count())) } When I submit this using to spark master as local[3] everything behaves as I'd expect. After some startup overhead, I'm seeing the count printed to be the same as the count I'm simulating (1 every second for example). When I submit this to a spark master using spark://master.host:7077, the behavior is different. The overhead go start receiving seems longer and some runs I don't see anything for 30 seconds even though my simulator is sending messages to the topic. I also see the following error written to stderr by every executor assigned to the job: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/05/01 10:11:38 INFO SecurityManager: Changing view acls to: username 15/05/01 10:11:38 INFO SecurityManager: Changing modify acls to: username 15/05/01 10:11:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(javi4211); users with modify permissions: Set(username) 15/05/01 10:11:38 INFO Slf4jLogger: Slf4jLogger started 15/05/01 10:11:38 INFO Remoting: Starting remoting 15/05/01 10:11:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@master.host:56534] 15/05/01 10:11:39 INFO Utils: Successfully started service 'driverPropsFetcher' on port 56534. 15/05/01 10:11:40 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@driver.host:51837]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: driver.host/10.27.51.214:51837http://10.27.51.214:51837 15/05/01 10:12:09 ERROR UserGroupInformation: PriviledgedActionException as:username cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:128) at
Re: Maximum Core Utilization
Hi, do you have information on how many partitions/tasks the stage/job is running? By default there is 1 core per task, and your number of concurrent tasks may be limiting core utilization. There are a few settings you could play with, assuming your issue is related to the above: spark.default.parallelism spark.cores.max spark.task.cpus On Tue, May 5, 2015 at 3:55 PM, Manu Kaul manohar.k...@gmail.com wrote: Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large number of cores available (like 100 cores), the Spark system seems to stop after using just 4 cores and after that the runtime is pretty much a straight line no matter how many more cores are thrown at it. I am wondering if Spark tries to figure out the maximum no. of cores to use based on the size of the dataset? If yes, is there a way to disable this feature and force it to use all the cores available? Thanks, Manu -- The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark. - Michelangelo
Multilabel Classification in spark
Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multilabel Classification in spark
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Help with datetime comparison in SparkSQL statement ...
Hello Friends: Here's sample output from a SparkSQL query that works, just so you can see the underlying data structure; followed by one that fails. # Just you you can see the DataFrame structure ... resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE trip_time_in_secs = 3780) resultsRDD.collect() # WORKS. [Row(pickup_datetime=datetime.datetime(2013, 10, 4, 8, 0), dropoff_datetime=datetime.datetime(2013, 10, 4, 9, 3), trip_time_in_secs=3780, trip_distance=17.10381469727), Row(pickup_datetime=datetime.datetime(2013, 10, 18, 8, 0), dropoff_datetime=datetime.datetime(2013, 10, 18, 9, 3), trip_time_in_secs=3780, trip_distance=17.92076293945), ... ) But the following SQL experiences the exception shown below. resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE pickup_datetime datetime.datetime(2013,12,1,0,0,0)) resultsRDD.collect() # FAILS. / py4j.protocol.Py4JJavaError: An error occurred while calling o53.sql.// // java.lang.RuntimeException: [1.62] failure: ``union'' expected but `(' found/ It's the first time I'm trying this and seemingly doing it incorrectly. Can anyone show me how to correct this? Thank you! =:) nmv -- PRISMALYTICS Sincerely yours, Team PRISMALYTICS PRISMALYTICS, LLC. http://www.prismalytics.com/ | www.prismalytics.com http://www.prismalytics.com/ P: 212.882.1276 tel:212.882.1276 | subscripti...@prismalytics.io mailto:subscripti...@prismalytics.io Follow Us: https://www.LinkedIn.com/company/prismalytics https://www.linkedin.com/company/prismalytics Prismalytics, LLC. http://www.prismalytics.com/ data analytics to literally count on
example code for current date in spark sql
Hi, In Hive , I am using unix_timestamp() as 'update_on' to insert current date in 'update_on' column of the table. Now I am converting it into spark sql. Please suggest example code to insert current date and time into column of the table using spark sql. CheersKiran.
setting spark configuration properties problem
Hi all, i have declared spark context at start of my program and then i want to change it's configurations at some later stage in my code as written below val conf = new SparkConf().setAppName(Cassandra Demo) var sc:SparkContext=new SparkContext(conf) sc.getConf.set(spark.cassandra.connection.host, host) println(sc.getConf.get(spark.cassandra.connection.host)) in the above code following exception occur Exception in thread main java.util.NoSuchElementException: spark.cassandra.connection.host Actually I want to set cassandra host property and i just want to set this if there is some job related to cassandra so i declare sparkcontext earlier and then want to set this property at some later stage. Any suggestion? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/setting-spark-configuration-properties-problem-tp22764.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JAVA for SPARK certification
Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
Map one RDD into two RDD
Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill
Re: Map one RDD into two RDD
Have you looked at RDD#randomSplit() (as example) ? Cheers On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote: Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many thanks. Bill
Configuring Number of Nodes with Standalone Scheduler
Hi I have a 1.0.0 cluster with multiple worker nodes that deploy a number of external tasks, through getRuntime().exec. Currently I have no control on how many nodes get deployed for a given task. At times scheduler evenly distributes the executors among all nodes and at other times it only uses 1 node. (The difficulty with the latter is that the deployed tasks run out of memory, at which point kernel intervenes and kills them.) I've tried setting spark.cores.max to available number of cores, spark.deploy.spreadOut to true, spark.scheduler.mode to FAIR, etc., to no avail. Is there a non-documented parameter or a priming procedure to do this? Cheers, [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Nastooh Avessta ENGINEER.SOFTWARE ENGINEERING nave...@cisco.com Phone: +1 604 647 1527 Cisco Systems Limited 595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121 VANCOUVER BRITISH COLUMBIA V7X 1J1 CA Cisco.comhttp://www.cisco.com/ [Think before you print.]Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada, M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferenceshttp://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribehttp://www.cisco.com/offer/unsubscribe/?sid=000478327 - Privacyhttp://www.cisco.com/web/siteassets/legal/privacy.html
Re: JAVA for SPARK certification
I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
Re: JAVA for SPARK certification
There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav
spark sql, creating literal columns in java.
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD coalesce or repartition by #records or #bytes?
Hi, Spark experts: I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which generates the rdd's in streamed batches. It generates numPartitions of files as expected with names dir/part-x. However, the first couple of files (e.g., part-0, part-1) have many times of records than the other files. This highly skewed distribution causes stragglers (and hence unpredictable execution time) and also the need to allocate memory by the worst cases (because some of the records can be much larger than average). To solve this problem, I replaced coalesce(numPartitions) with repartition(numPartitions) or coalesce(numPartitions, shuffle=true), which are equivalent. As a result, the records are more evenly distributed over the output files and the execution time becomes more predictable. It of coarse incurs a lot of shuffle traffic. However, the GC time became prohibitively high, which crashed my app in just a few hours. Adding more memory to executors didn't seem to help. Do you have any suggestion here on how to spread the data without the GC costs? Does repartition() redistribute/shuffle every record by hash partitioner? Why does it drive the GC time so high? Thanks,Du On Wednesday, March 4, 2015 5:39 PM, Zhan Zhang zzh...@hortonworks.com wrote: It use HashPartitioner to distribute the record to different partitions, but the key is just integer evenly across output partitions. From the code, each resulting partition will get very similar number of records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the coalesce/repartition works. Does it distribute by the number of records or number of bytes? In my app, my observation is that the distribution seems by number of records. The consequence is, however, some executors have to process x1000 as much as data when the sizes of records are very skewed. Then we have to allocate memory by the worst case. Is there a way to programmatically affect the coalesce /repartition scheme? Thanks,Du