Re: unsubsribe
I have already send minimum 10 times! Today also I have send one! On Tue, Oct 30, 2018 at 3:51 PM Biplob Biswas wrote: > You need to send the email to user-unsubscr...@spark.apache.org and not > to the usergroup. > > Thanks & Regards > Biplob Biswas > > > On Tue, Oct 30, 2018 at 10:59 AM Anu B Nair wrote: > >> I am sending this Unsubscribe mail for last few months! It never happens! >> If anyone can help us to unsubscribe it wil be really helpful! >> >> On Tue, Oct 30, 2018 at 3:27 PM Mohan Palavancha < >> mohan.palavan...@gmail.com> wrote: >> >>> >>>
Re: unsubsribe
I am sending this Unsubscribe mail for last few months! It never happens! If anyone can help us to unsubscribe it wil be really helpful! On Tue, Oct 30, 2018 at 3:27 PM Mohan Palavancha wrote: > >
Unsubscribe
Hi, I have tried all possible way to unsubscripted from this group. Can anyone help? -- Anu
Unsubscribe
Unsubscribe
Unsubscribe
unsubscribe
Java heap space OutOfMemoryError in pyspark spark-submit (spark version:2.2)
Hi, I have a data set size of 10GB(example Test.txt). I wrote my pyspark script like below(Test.py): *from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark = SparkSession.builder.appName("FilterProduct").getOrCreate() sc = spark.sparkContext sqlContext = SQLContext(sc) lines = spark.read.text("C:/Users/test/Desktop/Test.txt").rdd lines.collect()* Then I am executing the above script using below command : spark-submit Test.py --executor-memory 15G --driver-memory 15G Then I am getting error like below: *17/12/29 13:27:18 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/Test.txt, range: 402653184-536870912, partition values: [empty row] 17/12/29 13:27:18 INFO CodeGenerator: Code generated in 22.743725 ms 17/12/29 13:27:44 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/12/29 13:27:44 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2) java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93* Please let me know how to resolve this ? -- Anu
Fwd: [pyspark][MLlib] Getting WARN FPGrowth: Input data is not cached for cached data
Hi, Following is my pyspark code, (attached input sample_fpgrowth.txt and python code along with this mail. Even after I have done cache, I am getting Warning: Input data is not cached. *from pyspark.mllib.fpm import FPGrowthimport pysparkfrom pyspark.context import SparkContextfrom pyspark.sql.session import SparkSessionsc = SparkContext('local')data = sc.textFile("sample_fpgrowth.txt")transactions = data.map(lambda line: line.strip().split(' ')).cache()model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)result = model.freqItemsets().collect()print(result)* Understood that it is a warning, but just wanted to know in detail -- Anu r z h k p z y x w v u t s s x o n r x z y m t s q e z x z y r q t p from pyspark.mllib.fpm import FPGrowth import pyspark from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext('local') data = sc.textFile("sample_fpgrowth.txt") transactions = data.map(lambda line: line.strip().split(' ')).cache() model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10) result = model.freqItemsets().collect() print(result) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: SparkSQL Timestamp query failure
Hi Alessandro Could you specify which query were you able to run successfully? 1. sqlContext.sql(SELECT * FROM Logs as l where l.timestamp = '2012-10-08 16:10:36' ).collect OR 2. sqlContext.sql(SELECT * FROM Logs as l where cast(l.timestamp as string) = '2012-10-08 16:10:36.0').collect I am able to run only the second query, i.e. the one with timestamp casted to string. What is the use of even parsing my data to store timestamp values when I can't do = and = comparisons on timestamp?? In the above query, I am ultimately doing string comparisons, while I actually want to do comparison on timestamp values. *My Spark version is 1.1.0* Please somebody clarify why am I not able to perform queries like Select * from table1 where endTime = '2015-01-01 00:00:00' and endTime = '2015-01-10 00:00:00' without getting anything in the output. Even, the following doesn't work Select * from table1 where endTime = CAST('2015-01-01 00:00:00' as timestamp) and endTime = CAST('2015-01-10 00:00:00' as timestamp) I get the this error : *java.lang.RuntimeException: [1.99] failure: ``STRING'' expected but identifier timestamp found* Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p22292.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Measuer Bytes READ and Peak Memory Usage for Query
Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please clarify if Bytes Read = aggregate size of all RDDs ?? All my RDDs are in memory and 0B spill to disk. And I am clueless how to measure Peak Memory Usage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuer-Bytes-READ-and-Peak-Memory-Usage-for-Query-tp22159.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Hive on Spark with Spark as a service on CDH5.2
*I am not clear if spark sql supports HIve on Spark when spark is run as a service in CDH 5.2? * Can someone please clarify this. If this is possible, how what configuration changes have I to make to import hive context in spark shell as well as to be able to do a spark-submit for the job to be run on the entire cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-with-Spark-as-a-service-on-CDH5-2-tp22091.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Transform a Schema RDD to another Schema RDD with a different schema
I have a schema RDD with thw following Schema : scala mainRDD.printSchema root |-- COL1: integer (nullable = false) |-- COL2: integer (nullable = false) |-- COL3: string (nullable = true) |-- COL4: double (nullable = false) |-- COL5: string (nullable = true) Now, I transform the mainRDD like this : scala val sdf1 = new SimpleDateFormat(-mm-dd hh:mm:ss.SSS); val calendar = Calendar.getInstance() scala val mappedRDD : SchemaRDD = intf_ddRDD.map{ r = | val end_time = sdf1.parse(r(2).toString); | calendar.setTime(end_time); | val r2 = new java.sql.Timestamp(end_time.getTime); | val hour: Long = calendar.get(Calendar.HOUR_OF_DAY); | (r(0).toString.toInt, r(1).toString.toInt, r2, hour, r(3).toString.toDouble, r(4).toString) | } scalamappedRDD.printSchema root |-- _1: integer (nullable = false) |-- _2: integer (nullable = false) |-- _3: timestamp (nullable = true) |-- _4: long (nullable = false) |-- _5: double (nullable = false) |-- _6: string (nullable = true) But the issue is, despite specifying the mainRDD as SchemaRDD, it becomes just an RDD (notice that the column names are lost in mappedRDD) So, how can I do the above transformation on one SchemaRDD (mainRDD) to get another SchemaRDD (mappedRDD) with a different Schema. Please help me out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Transform-a-Schema-RDD-to-another-Schema-RDD-with-a-different-schema-tp22112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Iterate over contents of schemaRDD loaded from parquet file to extract timestamp
Spark Version - 1.1.0 Scala - 2.10.4 I have loaded following type data from a parquet file, stored in a schemaRDD [7654321,2015-01-01 00:00:00.007,0.49,THU] Since, in spark version 1.1.0, parquet format doesn't support saving timestamp valuues, I have saved the timestamp data as string. Can you please tell me how to iterate over the data in this schema RDD to retrieve the timestamp values and regsietr the mapped RDD as a Table and then be able to run queries like Select * from table where time = '2015-01-01 00:00:00.000' . I wrote the following code : val sdf = new SimpleDateFormat(-mm-dd hh:mm:ss.SSS); val calendar = Calendar.getInstance() val iddRDD = intf_ddRDD.map{ r = val end_time = sdf.parse(r(1).toString); calendar.setTime(end_time); val r1 = new java.sql.Timestamp(end_time.getTime); val hour: Long = calendar.get(Calendar.HOUR_OF_DAY); Row(r(0).toString.toInt, r1, hour, r(2).toString.toInt, r(3).toString) } This gives me * org.apache.spark.SparkException: Task not serializable* Please help !!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Iterate-over-contents-of-schemaRDD-loaded-from-parquet-file-to-extract-timestamp-tp22089.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Optimizing SQL Query
I have a query that's like: Could you help in providing me pointers as to how to start to optimize it w.r.t. spark sql: sqlContext.sql( SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM ( SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE FROM ( SELECT * FROM date_d dd JOIN interval_f intf ON intf.DATE_WID = dd.WID WHERE intf.DATE_WID = 20141116 AND intf.DATE_WID = 20141125 AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-16 00:00:00.000' AND CAST(INTERVAL_END_TIME AS STRING) = '2014-11-26 00:00:00.000' AND MEAS_WID = 3 ) test JOIN sdp_d sdp ON test.SDP_WID = sdp.WID WHERE sdp.UDC_ID = 'SP-1931201848' GROUP BY sdp.WID, DAY_OF_WEEK, HOUR, sdp.UDC_ID ) dw GROUP BY dw.DAY_OF_WEEK, dw.HOUR) Currently the query takes 15 minutes execution time where interval_f table holds approx 170GB of raw data, date_d -- 170 MB and sdp_d -- 490MB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-SQL-Query-tp21948.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL Timestamp query failure
Thank you Alessandro :) On Tue, Mar 3, 2015 at 10:03 AM, whitebread [via Apache Spark User List] ml-node+s1001560n2188...@n3.nabble.com wrote: Anu, 1) I defined my class Header as it follows: case class Header(timestamp: java.sql.Timestamp, c_ip: String, cs_username: String, s_ip: String, s_port: String, cs_method: String, cs_uri_stem: String, cs_query: String, sc_status: Int, sc_bytes: Int, cs_bytes: Int, time_taken: Int, User_Agent: String, Referrer: String) 2) Defined a function to transform date to timestamp: implicit def date2timestamp(date: java.util.Date) = new java.sql.Timestamp(date.getTime) 3) Defined the format of my timestamp val formatTime = new java.text.SimpleDateFormat(-MM-dd hh:mm:ss) 4) Finally, I was able to parse my data: val tableMod = toProcessLogs.map(_.split( )).map(p = (Header(date2timestamp(formatTime3.parse(p(0)+ +p(1))),p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9).trim.toInt, p(10).trim.toInt, p(11).trim.toInt, p(12).trim.toInt, p(13), p(14 Hope this helps, Alessandro -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p21884.html To unsubscribe from SparkSQL Timestamp query failure, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19502code=YW5hbWlrYS5ndW9wdGFAZ21haWwuY29tfDE5NTAyfDE1MjUxMDc5MQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p21885.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SparkSQL Timestamp query failure
Can you please post how did you overcome this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Timestamp-query-failure-tp19502p21868.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Facing error: java.lang.ArrayIndexOutOfBoundsException while executing SparkSQL join query
I have three tables with the following schema: case class *date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp, DATE_STRING: String, DAY_OF_WEEK: String, DAY_OF_MONTH: Int, DAY_OF_YEAR: Int, END_OF_MONTH_FLAG: String, YEARWEEK: Int, CALENDAR_MONTH: String, MONTH_NUM: Int, YEARMONTH: Int, QUARTER: Int, YEAR: Int) case class *interval_f*(ORG_ID: Int, CHANNEL_WID: Int, SDP_WID: Int, MEAS_WID: Int, DATE_WID: Int, TIME_WID: Int, VALIDATION_STATUS_CD: Int, VAL_FAIL_CD:Int, INTERVAL_FLAG_CD: Int, CHANGE_METHOD_WID:Int, SOURCE_LAST_UPD_TIME: java.sql.Timestamp, INTERVAL_END_TIME: java.sql.Timestamp, LOCKED: String, EXT_VERSION_TIME: java.sql.Timestamp, INTERVAL_VALUE: Double, INSERT_TIME: java.sql.Timestamp, LAST_UPD_TIME: java.sql.Timestamp) class * sdp_d*( WID :Option[Int], BATCH_ID :Option[Int], SRC_ID :Option[String], ORG_ID :Option[Int], CLASS_WID :Option[Int], DESC_TEXT :Option[String], PREMISE_WID :Option[Int], FEED_LOC :Option[String], GPS_LAT :Option[Double], GPS_LONG :Option[Double], PULSE_OUTPUT_BLOCK :Option[String], UDC_ID :Option[String], UNIVERSAL_ID :Option[String], IS_VIRTUAL_FLG :Option[String], SEAL_INFO :Option[String], ACCESS_INFO :Option[String], ALT_ACCESS_INFO :Option[String], LOC_INFO :Option[String], ALT_LOC_INFO :Option[String], TYPE :Option[String], SUB_TYPE :Option[String], TIMEZONE_ID :Option[Int], GIS_ID :Option[String], BILLED_UPTO_TIME :Option[java.sql.Timestamp], POWER_STATUS :Option[String], LOAD_STATUS :Option[String], BILLING_HOLD_STATUS :Option[String], INSERT_TIME :Option[java.sql.Timestamp], LAST_UPD_TIME :Option[java.sql.Timestamp]) extends Product{ @throws(classOf[IndexOutOfBoundsException]) override def productElement(n: Int) = n match { case 0 = WID; case 1 = BATCH_ID; case 2 = SRC_ID; case 3 = ORG_ID; case 4 = CLASS_WID; case 5 = DESC_TEXT; case 6 = PREMISE_WID; case 7 = FEED_LOC; case 8 = GPS_LAT; case 9 = GPS_LONG; case 10 = PULSE_OUTPUT_BLOCK; case 11 = UDC_ID; case 12 = UNIVERSAL_ID; case 13 = IS_VIRTUAL_FLG; case 14 = SEAL_INFO; case 15 = ACCESS_INFO; case 16 = ALT_ACCESS_INFO; case 17 = LOC_INFO; case 18 = ALT_LOC_INFO; case 19 = TYPE; case 20 = SUB_TYPE; case 21 = TIMEZONE_ID; case 22 = GIS_ID; case 23 = BILLED_UPTO_TIME; case 24 = POWER_STATUS; case 25 = LOAD_STATUS; case 26 = BILLING_HOLD_STATUS; case 27 = INSERT_TIME; case 28 = LAST_UPD_TIME; case _ = throw new IndexOutOfBoundsException(n.toString()) } override def productArity: Int = 29; override def canEqual(that: Any): Boolean = that.isInstanceOf[sdp_d] } Non-join queries work fine: *val q1 = sqlContext.sql(SELECT YEAR, DAY_OF_YEAR, MAX(WID), MIN(WID), COUNT(*) FROM date_d GROUP BY YEAR, DAY_OF_YEAR ORDER BY YEAR, DAY_OF_YEAR)* res4: Array[org.apache.spark.sql.Row] = Array([2014,305,20141101,20141101,1], [2014,306,20141102,20141102,1], [2014,307,20141103,20141103,1], [2014,308,20141104,20141104,1], [2014,309,20141105,20141105,1], [2014,310,20141106,20141106,1], [2014,311,20141107,20141107,1], [2014,312,20141108,20141108,1], [2014,313,20141109,20141109,1], [2014,314,20141110,20141110,1], [2014,315,2014,2014,1], [2014,316,20141112,20141112,1], [2014,317,20141113,20141113,1], [2014,318,20141114,20141114,1], [2014,319,20141115,20141115,1], [2014,320,20141116,20141116,1], [2014,321,20141117,20141117,1], [2014,322,20141118,20141118,1], [2014,323,20141119,20141119,1], [2014,324,20141120,20141120,1], [2014,325,20141121,20141121,1], [2014,326,20141122,20141122,1], [2014,327,20141123,20141123,1], [2014,328,20141... *But the join queries throw this error: java.lang.ArrayIndexOutOfBoundsException* *scala val q = sqlContext.sql(select * from date_d dd join interval_f intf on intf.DATE_WID = dd.WID Where intf.DATE_WID = 20141101 AND intf.DATE_WID = 20141110)* q: org.apache.spark.sql.SchemaRDD = SchemaRDD[38] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [WID#0,CALENDAR_DATE#1,DATE_STRING#2,DAY_OF_WEEK#3,DAY_OF_MONTH#4,DAY_OF_YEAR#5,END_OF_MONTH_FLAG#6,YEARWEEK#7,CALENDAR_MONTH#8,MONTH_NUM#9,YEARMONTH#10,QUARTER#11,YEAR#12,ORG_ID#13,CHANNEL_WID#14,SDP_WID#15,MEAS_WID#16,DATE_WID#17,TIME_WID#18,VALIDATION_STATUS_CD#19,VAL_FAIL_CD#20,INTERVAL_FLAG_CD#21,CHANGE_METHOD_WID#22,SOURCE_LAST_UPD_TIME#23,INTERVAL_END_TIME#24,LOCKED#25,EXT_VERSION_TIME#26,INTERVAL_VALUE#27,INSERT_TIME#28,LAST_UPD_TIME#29] ShuffledHashJoin [WID#0], [DATE_WID#17], BuildRight Exchange (HashPartitioning [WID#0], 200) InMemoryColumnarTableScan [WID#0,CALENDAR_DATE#1,DATE_STRING#2,DAY_OF_WEEK#3,DAY_OF_MONTH#4,DAY_OF_YEAR#5,END_OF_MONTH_FLA... *scala q.take(5).foreach(println)* 15/02/27 15:50:26 INFO SparkContext: Starting job: runJob at basicOperators.scala:136 15/02/27 15:50:26 INFO DAGScheduler: Registering RDD 46 (mapPartitions at Exchange.scala:48) 15/02/27 15:50:26 INFO FileInputFormat: Total input paths to process : 1 15/02/27 15:50:26 INFO DAGScheduler: Registering RDD 42 (mapPartitions at Exchange.scala:48) 15/02/27 15:50:26 INFO DAGScheduler: Got job 2
Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell
My issue is posted here on stack-overflow. What am I doing wrong here? http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Facing-error-while-extending-scala-class-with-Product-interface-to-overcome-limit-of-22-fields-in-spl-tp21787.html Sent from the Apache Spark User List mailing list archive at Nabble.com.