streaming of binary files in PySpark

2017-05-22 Thread Yogesh Vyas
Hi,

I want to use Spark Streaming to read the binary files from HDFS. In the
documentation, it is mentioned to use binaryRecordStream(directory,
recordLength).
But I didn't understand what does the record length means?? Does it means
the size of the binary file or something else?


Regards,
Yogesh


pandas DF Dstream to Spark DF

2017-04-09 Thread Yogesh Vyas
Hi,

I am writing a pyspark streaming job in which i am returning a pandas data
frame as DStream. Now I wanted to save this DStream dataframe to parquet
file. How to do that?

I am trying to convert it to spark data frame but I am getting multiple
errors. Please suggest me how to do that.

Regards,
Yogesh


pandas DF DStream to Spark dataframe

2017-04-09 Thread Yogesh Vyas
Hi,

I am writing a pyspark streaming job in which i am returning a pandas data
frame as DStream. Now I wanted to save this DStream dataframe to parquet
file. How to do that?

I am trying to convert it to spark data frame but I am getting multiple
errors. Please suggest me how to do that.


Regards,
Yogesh


use UTF-16 decode in pyspark streaming

2017-04-06 Thread Yogesh Vyas
Hi,

I am trying to decode the binary data using UTF-16 decode in Kafka consumer
using spark streaming. But it is giving error:
TypeError: 'str' object is not callable

I am doing it in following way:

kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",
{topic: 1},valueDecoder="utfTo16")

def utfTo16(msg):
  return msg.decode("utf-16")


Please suggest if I am doing it right or not??


Regards,
Yogesh


reading binary file in spark-kafka streaming

2017-04-05 Thread Yogesh Vyas
Hi,

I am having a binary file which I try to read in Kafka Producer and send to
message queue. This I read in the Spark-Kafka consumer as streaming job.
But it is giving me following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 112:
invalid start byte

Can anyone please tell me why that error is and how to fix it?

Regards,
Yogesh


read binary file in PySpark

2017-04-02 Thread Yogesh Vyas
Hi,

I am trying to read binary file in PySpark using API binaryRecords(path,
recordLength), but it is giving all values as ['\x00', '\x00', '\x00',
'\x00',].

But when I am trying to read the same file using binaryFiles(0, it is
giving me correct rdd, but in form of key-value pair. The value is a string.

I wanted to get the byte array out of binary file. How to get it.??

Regards,
Yogesh


Disable logger in SparkR

2016-08-22 Thread Yogesh Vyas
Hi,

Is there any way of disabling the logging on console in SparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



UDF in SparkR

2016-08-17 Thread Yogesh Vyas
Hi,

Is there is any way of using UDF in SparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



XLConnect in SparkR

2016-07-20 Thread Yogesh Vyas
Hi,

I am trying to load and read excel sheets from HDFS in sparkR using
XLConnect package.
Can anyone help me in finding out how to read xls files from HDFS in sparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Handle empty kafka in Spark Streaming

2016-06-15 Thread Yogesh Vyas
Hi,

Does anyone knows how to handle empty Kafka while Spark Streaming job
is running ?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi,
I finally got it working.
I was using the updateStateByKey() function to maintain the previous
value of the state, and I found that the event list was empty. Hence
handling the empty event list by using event.isEmtpy() sort out the
problem.

On Sun, May 22, 2016 at 7:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> You mean when rdd.isEmpty() returned false, saveAsTextFile still produced
> empty file ?
>
> Can you show code snippet that demonstrates this ?
>
> Cheers
>
> On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas <informy...@gmail.com> wrote:
>>
>> Hi,
>> I am reading files using textFileStream, performing some action onto
>> it and then saving it to HDFS using saveAsTextFile.
>> But whenever there is no file to read, Spark will write and empty RDD(
>> [] ) to HDFS.
>> So, how to handle the empty RDD.
>>
>> I checked rdd.isEmpty() and rdd.count>0, but both of them does not works.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi,
I am reading files using textFileStream, performing some action onto
it and then saving it to HDFS using saveAsTextFile.
But whenever there is no file to read, Spark will write and empty RDD(
[] ) to HDFS.
So, how to handle the empty RDD.

I checked rdd.isEmpty() and rdd.count>0, but both of them does not works.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Filter out the elements from xml file in Spark

2016-05-19 Thread Yogesh Vyas
Hi,
I had xml files which I am reading through textFileStream, and then
filtering out the required elements using traditional conditions and
loops. I would like to know if  there is any specific packages or
functions provided in spark to perform operations on RDD of xml?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



File not found exception while reading from folder using textFileStream

2016-05-18 Thread Yogesh Vyas
Hi,
I am trying to read the files in a streaming way using Spark
Streaming. For this I am copying files from my local folder to the
source folder from where spark reads the file.
After reading and printing some of the files, it gives the following error:

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
File does not exist: /user/hadoop/file17.xml._COPYING_

I guess the Spark Streaming file is trying to read the file before it
gets copied completely.

Does anyone knows how to handle such type of exception?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Save DataFrame to Hive Table

2016-02-29 Thread Yogesh Vyas
Hi,

I have created a DataFrame in Spark, now I want to save it directly
into the hive table. How to do it.?

I have created the hive table using following hiveContext:

HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
hiveContext.sql("CREATE TABLE IF NOT EXISTS TableName (key
INT, value STRING)");

I am using the following to save it into hive:
DataFrame.write().mode(SaveMode.Append).insertInto("TableName");

But it gives the error:
Exception in thread "main" java.lang.RuntimeException: Table Not
Found: TableName
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:257)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:266)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:264)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:254)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:917)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:921)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:921)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:176)
at 
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164)
at com.honeywell.Track.combine.App.main(App.java:451)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application

2016-02-15 Thread Yogesh Vyas
Hi,
 I am trying to run a KMeansStreaming from the Java application, but
it gives the following error:

"Getting java.lang.IllegalArgumentException: requirement failed while
calling Sparks MLLIB StreamingKMeans from java application"

Below is my code:

JavaDStream v = trainingData.map(new Function() {

public Vector call(String arg0) throws Exception {
// TODO Auto-generated method stub
String[] p = arg0.split(",");
double[] d = new double[p.length] ;
for(int i=0;i

Visualization of KMeans cluster in Spark

2016-01-28 Thread Yogesh Vyas
Hi,

Is there any way to visualizing the KMeans clusters in spark?
Can we connect Plotly with Apache Spark in Java?

Thanks,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NoSuchMethodError

2015-11-15 Thread Yogesh Vyas
Hi,

While I am trying to read a json file using SQLContext, i get the
following error:

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


I am using pom.xml with following dependencies and versions:
spark-core_2.11 with version 1.5.1
spark-streaming_2.11 with version 1.5.1
spark-sql_2.11 with version 1.5.1

Can anyone please help me out in resolving this ?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchMethodError

2015-11-15 Thread Yogesh Vyas
I am trying to just read a JSON file in SQLContext and print the
dataframe as follows:

 SparkConf conf = new SparkConf().setMaster("local").setAppName("AppName");

 JavaSparkContext sc = new JavaSparkContext(conf);

 SQLContext sqlContext = new SQLContext(sc);

 DataFrame df = sqlContext.read().json(pathToJSONFile);

 df.show();

On Mon, Nov 16, 2015 at 12:48 PM, Fengdong Yu <fengdo...@everstring.com> wrote:
> what’s your SQL?
>
>
>
>
>> On Nov 16, 2015, at 3:02 PM, Yogesh Vyas <informy...@gmail.com> wrote:
>>
>> Hi,
>>
>> While I am trying to read a json file using SQLContext, i get the
>> following error:
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> org.apache.spark.sql.SQLContext.(Lorg/apache/spark/api/java/JavaSparkContext;)V
>>at com.honeywell.test.testhive.HiveSpark.main(HiveSpark.java:15)
>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>at java.lang.reflect.Method.invoke(Method.java:597)
>>at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> I am using pom.xml with following dependencies and versions:
>> spark-core_2.11 with version 1.5.1
>> spark-streaming_2.11 with version 1.5.1
>> spark-sql_2.11 with version 1.5.1
>>
>> Can anyone please help me out in resolving this ?
>>
>> Regards,
>> Yogesh
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JMX with Spark

2015-11-05 Thread Yogesh Vyas
Hi,
Please let me elaborate my question so that you will get to know what
exactly I want.

I am running a Spark Streaming job. This job is to count number of
occurrence of the event. Right now I am using a key/value pair RDD
which tells me the count of an event, where key is the event and value
is the number of counts. What I want to is to create a web based
monitoring control system, which will get connected to the MBean
Server and the count value will be displayed on the monitoring system
as it changes.

On Thu, Nov 5, 2015 at 5:54 PM, Romi Kuntsman <r...@totango.com> wrote:
> Have you read this?
> https://spark.apache.org/docs/latest/monitoring.html
>
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com
>
> On Thu, Nov 5, 2015 at 2:08 PM, Yogesh Vyas <informy...@gmail.com> wrote:
>>
>> Hi,
>> How we can use JMX and JConsole to monitor our Spark applications?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JMX with Spark

2015-11-05 Thread Yogesh Vyas
Hi,
How we can use JMX and JConsole to monitor our Spark applications?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: Get the previous state string

2015-10-15 Thread Yogesh Vyas
-- Forwarded message --
From: Yogesh Vyas <informy...@gmail.com>
Date: Thu, Oct 15, 2015 at 6:08 PM
Subject: Get the previous state string
To: user@spark.apache.org


Hi,
I am new to Spark and was trying to do some experiments with it.

I had a JavaPairDStream<String, List> RDD.
I want to get the list of string from its previous state. For that I
use updateStateByKey function as follows:

final Function2<List, Optional<List>,
Optional<List>> updateFunc =
   new Function2<List, Optional<List>,
Optional<List>>() {

public Optional<List> call(List arg0,
Optional<List> arg1) throws Exception {
// TODO Auto-generated method stub
if(arg1.toString()==null)
   return Optional.of(arg0);
else {
   arg0.add(arg1.toString());
   return Optional.of(arg0);
}
   }
};

I want the function to append the new list of string to the previous
list and return the new list. But I am not able to do so. I am getting
the " java.lang.UnsupportedOperationException" error.
Can anyone which help me out in getting the desired output?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Get list of Strings from its Previous State

2015-10-15 Thread Yogesh Vyas
Hi,
I am new to Spark and was trying to do some experiments with it.

I had a JavaPairDStream RDD.
I want to get the list of string from its previous state. For that I
use updateStateByKey function as follows:

final Function2,
Optional> updateFunc =
   new Function2,
Optional>() {

public Optional call(List arg0,
Optional arg1) throws Exception {
// TODO Auto-generated method stub
if(arg1.toString()==null)
   return Optional.of(arg0);
else {
   arg0.add(arg1.toString());
   return Optional.of(arg0);
}
   }
};

I want the function to append the new list of string to the previous
list and return the new list. But I am not able to do so. I am getting
the " java.lang.UnsupportedOperationException" error.
Can anyone which help me out in getting the desired output?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org