streaming of binary files in PySpark
Hi, I want to use Spark Streaming to read the binary files from HDFS. In the documentation, it is mentioned to use binaryRecordStream(directory, recordLength). But I didn't understand what does the record length means?? Does it means the size of the binary file or something else? Regards, Yogesh
pandas DF Dstream to Spark DF
Hi, I am writing a pyspark streaming job in which i am returning a pandas data frame as DStream. Now I wanted to save this DStream dataframe to parquet file. How to do that? I am trying to convert it to spark data frame but I am getting multiple errors. Please suggest me how to do that. Regards, Yogesh
pandas DF DStream to Spark dataframe
Hi, I am writing a pyspark streaming job in which i am returning a pandas data frame as DStream. Now I wanted to save this DStream dataframe to parquet file. How to do that? I am trying to convert it to spark data frame but I am getting multiple errors. Please suggest me how to do that. Regards, Yogesh
use UTF-16 decode in pyspark streaming
Hi, I am trying to decode the binary data using UTF-16 decode in Kafka consumer using spark streaming. But it is giving error: TypeError: 'str' object is not callable I am doing it in following way: kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1},valueDecoder="utfTo16") def utfTo16(msg): return msg.decode("utf-16") Please suggest if I am doing it right or not?? Regards, Yogesh
reading binary file in spark-kafka streaming
Hi, I am having a binary file which I try to read in Kafka Producer and send to message queue. This I read in the Spark-Kafka consumer as streaming job. But it is giving me following error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 112: invalid start byte Can anyone please tell me why that error is and how to fix it? Regards, Yogesh
read binary file in PySpark
Hi, I am trying to read binary file in PySpark using API binaryRecords(path, recordLength), but it is giving all values as ['\x00', '\x00', '\x00', '\x00',]. But when I am trying to read the same file using binaryFiles(0, it is giving me correct rdd, but in form of key-value pair. The value is a string. I wanted to get the byte array out of binary file. How to get it.?? Regards, Yogesh
Disable logger in SparkR
Hi, Is there any way of disabling the logging on console in SparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
UDF in SparkR
Hi, Is there is any way of using UDF in SparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
XLConnect in SparkR
Hi, I am trying to load and read excel sheets from HDFS in sparkR using XLConnect package. Can anyone help me in finding out how to read xls files from HDFS in sparkR ? Regards, Yogesh - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Handle empty kafka in Spark Streaming
Hi, Does anyone knows how to handle empty Kafka while Spark Streaming job is running ? Regards, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Handling Empty RDD
Hi, I finally got it working. I was using the updateStateByKey() function to maintain the previous value of the state, and I found that the event list was empty. Hence handling the empty event list by using event.isEmtpy() sort out the problem. On Sun, May 22, 2016 at 7:59 PM, Ted Yu <yuzhih...@gmail.com> wrote: > You mean when rdd.isEmpty() returned false, saveAsTextFile still produced > empty file ? > > Can you show code snippet that demonstrates this ? > > Cheers > > On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas <informy...@gmail.com> wrote: >> >> Hi, >> I am reading files using textFileStream, performing some action onto >> it and then saving it to HDFS using saveAsTextFile. >> But whenever there is no file to read, Spark will write and empty RDD( >> [] ) to HDFS. >> So, how to handle the empty RDD. >> >> I checked rdd.isEmpty() and rdd.count>0, but both of them does not works. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Handling Empty RDD
Hi, I am reading files using textFileStream, performing some action onto it and then saving it to HDFS using saveAsTextFile. But whenever there is no file to read, Spark will write and empty RDD( [] ) to HDFS. So, how to handle the empty RDD. I checked rdd.isEmpty() and rdd.count>0, but both of them does not works. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Filter out the elements from xml file in Spark
Hi, I had xml files which I am reading through textFileStream, and then filtering out the required elements using traditional conditions and loops. I would like to know if there is any specific packages or functions provided in spark to perform operations on RDD of xml? Regards, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
File not found exception while reading from folder using textFileStream
Hi, I am trying to read the files in a streaming way using Spark Streaming. For this I am copying files from my local folder to the source folder from where spark reads the file. After reading and printing some of the files, it gives the following error: Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/hadoop/file17.xml._COPYING_ I guess the Spark Streaming file is trying to read the file before it gets copied completely. Does anyone knows how to handle such type of exception? Regards, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Save DataFrame to Hive Table
Hi, I have created a DataFrame in Spark, now I want to save it directly into the hive table. How to do it.? I have created the hive table using following hiveContext: HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key INT, value STRING)"); I am using the following to save it into hive: DataFrame.write().mode(SaveMode.Append).insertInto("TableName"); But it gives the error: Exception in thread "main" java.lang.RuntimeException: Table Not Found: TableName at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164) at com.honeywell.Track.combine.App.main(App.java:451) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application
Hi, I am trying to run a KMeansStreaming from the Java application, but it gives the following error: "Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application" Below is my code: JavaDStream v = trainingData.map(new Function() { public Vector call(String arg0) throws Exception { // TODO Auto-generated method stub String[] p = arg0.split(","); double[] d = new double[p.length] ; for(int i=0;i
Visualization of KMeans cluster in Spark
Hi, Is there any way to visualizing the KMeans clusters in spark? Can we connect Plotly with Apache Spark in Java? Thanks, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NoSuchMethodError
Hi, While I am trying to read a json file using SQLContext, i get the following error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I am using pom.xml with following dependencies and versions: spark-core_2.11 with version 1.5.1 spark-streaming_2.11 with version 1.5.1 spark-sql_2.11 with version 1.5.1 Can anyone please help me out in resolving this ? Regards, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoSuchMethodError
I am trying to just read a JSON file in SQLContext and print the dataframe as follows: SparkConf conf = new SparkConf().setMaster("local").setAppName("AppName"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sc); DataFrame df = sqlContext.read().json(pathToJSONFile); df.show(); On Mon, Nov 16, 2015 at 12:48 PM, Fengdong Yu <fengdo...@everstring.com> wrote: > what’s your SQL? > > > > >> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas <informy...@gmail.com> wrote: >> >> Hi, >> >> While I am trying to read a json file using SQLContext, i get the >> following error: >> >> Exception in thread "main" java.lang.NoSuchMethodError: >> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V >>at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15) >>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>at java.lang.reflect.Method.invoke(Method.java:597) >>at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) >>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) >>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> >> I am using pom.xml with following dependencies and versions: >> spark-core_2.11 with version 1.5.1 >> spark-streaming_2.11 with version 1.5.1 >> spark-sql_2.11 with version 1.5.1 >> >> Can anyone please help me out in resolving this ? >> >> Regards, >> Yogesh >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JMX with Spark
Hi, Please let me elaborate my question so that you will get to know what exactly I want. I am running a Spark Streaming job. This job is to count number of occurrence of the event. Right now I am using a key/value pair RDD which tells me the count of an event, where key is the event and value is the number of counts. What I want to is to create a web based monitoring control system, which will get connected to the MBean Server and the count value will be displayed on the monitoring system as it changes. On Thu, Nov 5, 2015 at 5:54 PM, Romi Kuntsman <r...@totango.com> wrote: > Have you read this? > https://spark.apache.org/docs/latest/monitoring.html > > Romi Kuntsman, Big Data Engineer > http://www.totango.com > > On Thu, Nov 5, 2015 at 2:08 PM, Yogesh Vyas <informy...@gmail.com> wrote: >> >> Hi, >> How we can use JMX and JConsole to monitor our Spark applications? >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JMX with Spark
Hi, How we can use JMX and JConsole to monitor our Spark applications? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fwd: Get the previous state string
-- Forwarded message -- From: Yogesh Vyas <informy...@gmail.com> Date: Thu, Oct 15, 2015 at 6:08 PM Subject: Get the previous state string To: user@spark.apache.org Hi, I am new to Spark and was trying to do some experiments with it. I had a JavaPairDStream<String, List> RDD. I want to get the list of string from its previous state. For that I use updateStateByKey function as follows: final Function2<List, Optional<List>, Optional<List>> updateFunc = new Function2<List, Optional<List>, Optional<List>>() { public Optional<List> call(List arg0, Optional<List> arg1) throws Exception { // TODO Auto-generated method stub if(arg1.toString()==null) return Optional.of(arg0); else { arg0.add(arg1.toString()); return Optional.of(arg0); } } }; I want the function to append the new list of string to the previous list and return the new list. But I am not able to do so. I am getting the " java.lang.UnsupportedOperationException" error. Can anyone which help me out in getting the desired output? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Get list of Strings from its Previous State
Hi, I am new to Spark and was trying to do some experiments with it. I had a JavaPairDStreamRDD. I want to get the list of string from its previous state. For that I use updateStateByKey function as follows: final Function2 , Optional
> updateFunc = new Function2
, Optional
>() { public Optional
call(List arg0, Optional
arg1) throws Exception { // TODO Auto-generated method stub if(arg1.toString()==null) return Optional.of(arg0); else { arg0.add(arg1.toString()); return Optional.of(arg0); } } }; I want the function to append the new list of string to the previous list and return the new list. But I am not able to do so. I am getting the " java.lang.UnsupportedOperationException" error. Can anyone which help me out in getting the desired output? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org