Best practices on how to multiple spark sessions

2018-09-16 Thread unk1102
Hi I have application which servers as ETL job and I have hundreds of such
ETL jobs which runs daily now as of now I have just one spark session which
is shared by all these jobs and sometimes all of these jobs run at the same
time causing spark session to die due memory issues mostly. Is this a good
design? I am thinking to create multiple spark sessions possibly one spark
session for each ETL job but there is delay in starting spark session which
seems to multiple by no of ETL jobs. Please share best practices and designs
for such problems. Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is Spark DataFrame limit function action or transformation?

2018-05-31 Thread unk1102
Is Spark DataFrame limit function action or transformation? I think it
returns DataFrame so it should be a transformation but it executes entire
DAG so I think it is action. Same goes to persist function. Please guide.
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread unk1102
Hi I came by one Spark question which was about which spark cluster manager
does not support horizontal scalability? Answer options were Mesos, Yarn,
Standalone and local mode. I believe all cluster managers are horizontal
scalable please correct if I am wrong. And I think answer is local mode. Is
it true? Please guide. Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Best practices to keep multiple version of schema in Spark

2018-04-30 Thread unk1102
Hi I have a couple of datasets where schema keep on changing and I store it
as parquet files. Now I use mergeSchema option while loading these different
schema parquet files in a DataFrame and it works all fine. Now I have a
requirement of maintaining difference between schema over time basically
maintaining list of columns which are latest. Please guide if anybody has
done similar work or in general best practices to maintain changes of
columns over time. Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Thanks much Nicolas really appreciate it.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can
push that code to github and share url it would be a great help. Looking
forward. If you can find time to push early it would be even greater help as
I have to finish POC on this use case ASAP.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere?
Do your just keep pdf in avro binary all the time? How often you parse into
text using pdfbox? Is it on demand basis or you always parse as text and
keep pdf as binary in avro as just interim state?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi Vadim thanks I use HortonWorks package. I dont think there are any seg
faults are dataframe I am trying to write is very small in size. Can it
still create seg fault?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks Vadim you are right I saw that line already 468 I dont see any code
it is just comment yes I am sure I am using all spark-* jar which is built
for spark 2.2.0 and Scala 2.11. I am also stuck unfortunately with these
errors not sure how to solve them.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks for the reply I only see NPE and Task failed while writing rows all
over places I dont see any other errors expect SparkException job aborted
and followed by two exception I pasted earlier. 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi I am getting the following exception when I try to write DataFrame using
the following code. Please guide. I am using Spark 2.2.0.

df.write.format("parquet").mode(SaveMode.Append);

org.apache.spark.SparkException: Task failed while writing rows at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.NullPointerException at
org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
at
org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:324)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition
but I dont use any threads just one thread which owns spark context is
writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark
1.5.1. Please guide. Thanks in advance.


java.io.IOException: The file being written is in an invalid state. Probably
caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-error-while-saving-in-HDFS-tp28998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to give name to Spark jobs shown in Spark UI

2016-07-27 Thread unk1102
Thank Rahul I think you didn't read question properly I have one main spark
job which I name using the approach you described. As part of main spark
job I create multiple threads which essentially becomes child spark jobs
and those jobs has no direct way of naming.

On Jul 27, 2016 11:17, "rahulkumar-aws [via Apache Spark User List]" <
ml-node+s1001560n27414...@n3.nabble.com> wrote:

> You can set name in SparkConf() or if You are using Spark submit set
> --name flag
>
> *val sparkconf = new SparkConf()*
> * .setMaster("local[4]")*
> * .setAppName("saveFileJob")*
> *val sc = new SparkContext(sparkconf)*
>
>
> or spark-submit :
>
> *./bin/spark-submit --name "FileSaveJob" --master local[4]  fileSaver.jar*
>
>
>
>
> On Mon, Jul 25, 2016 at 9:46 PM, neil90 [via Apache Spark User List] <[hidden
> email] > wrote:
>
>> As far as I know you can give a name to the SparkContext. I recommend
>> using a cluster monitoring tool like Ganglia to determine were its slow in
>> your spark jobs.
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27406.html
>> To start a new topic under Apache Spark User List, email [hidden email]
>> 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> 
>>
>
> Software Developer Sigmoid (SigmoidAnalytics), India
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27414.html
> To unsubscribe from How to give name to Spark jobs shown in Spark UI, click
> here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400p27415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread unk1102
Hi I have multiple child spark jobs run at a time. Is there any way to name
these child spark jobs so I can identify slow running ones. For e. g.
xyz_saveAsTextFile(),  abc_saveAsTextFile() etc please guide. Thanks in
advance. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-give-name-to-Spark-jobs-shown-in-Spark-UI-tp27400.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use
case. I have large dataset around 1 TB which I need to process/update in
DataFrame. Now my jobs shuffles huge data and slows things because of
shuffling and groupby. One reason I see is my data is skew some of my group
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by. 

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla");

How do I change above code into using reduceByKey() can we apply aggregation
on reduceByKey()? Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-change-Spark-DataFrame-groupby-col1-coln-into-reduceByKey-tp26998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-21 Thread unk1102
Hi I am having DataFrame with huge skew data in terms of TB and I am doing
groupby on 8 fields which I cant avoid unfortunately. I am looking to
optimize this I have found hive has

set hive.groupby.skewindata=true;

I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to avoid empty unavoidable group by keys in DataFrame?

2016-05-21 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use
case. I have large dataset around 1 TB which I need to process/update in
DataFrame. Now my jobs shuffles huge data and slows things because of
shuffling and groupby. One reason I see is my data is skew some of my group
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by.

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla"); 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-empty-unavoidable-group-by-keys-in-DataFrame-tp26992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread unk1102
Hi I have spark job which I run on yarn and sometimes it behaves in weird
manner it shows negative no of tasks in few executors and I keep on loosing
executors I also see no of executors are more than I requested. My job is
highly tuned not getting OOM or any problem. It is just YARN behaves in a
way sometimes so that executors keep on getting killed because of resource
crunching. Please guide how do I control YARN from behaving bad.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-0-running-jobs-in-yarn-shows-negative-no-of-tasks-in-executor-tp26337.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



jssc.textFileStream(directory) how to ensure it read entire all incoming files

2016-02-09 Thread unk1102
Hi my actual use case is streaming text files in HDFS directory and send it
to Kafka please let me know if is there any existing solution for this.
Anyways I have the following code 

//lets assume directory contains one file a.txt and it has 100 lines 
JavaDStream logData = jssc.textFileStream(directory);

//how do I make sure jssc.textFileStream() read all 100 lines so I can
delete it later

Also please let me know how do I send above logData to Kafka using Spark
streaming



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jssc-textFileStream-directory-how-to-ensure-it-read-entire-all-incoming-files-tp26181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to send JavaDStream RDD using foreachRDD using Java

2016-02-09 Thread unk1102
Hi Sachin, how did you write to Kafka from Spark I cant find the following
method sendString and sendDataAsString in KafkaUtils can you please guide?

KafkaUtil.sendString(p,topic,result.get(0)); 
 KafkaUtils.sendDataAsString(MTP,topicName, result.get(0)); 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-send-JavaDStream-RDD-using-foreachRDD-using-Java-tp21456p26183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming with Druid?

2016-02-06 Thread unk1102
Hi did anybody tried Spark Streaming with Druid as low latency store?
Combination seems powerful is it worth trying both together? Please guide
and share your experience. I am after creating the best low latency
streaming analytics.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Druid-tp26164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark MLLlib Ideal way to convert categorical features into LabeledPoint RDD?

2016-02-01 Thread unk1102
Hi I have dataset which is completely categorical and it does not contain
even one column as numerical. Now I want to apply classification using Naive
Bayes I have to predict whether given alert is actionable or not using
YES/NO I have the following example of my dataset

DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO

I am using Spark 1.6 and I see there is StringIndexer class which is used
OneHotEncoding example given here
https://spark.apache.org/docs/latest/ml-features.html#onehotencoder but I
have almost 1 unique words/features to map into continuous how do I
create such a huge map. I have my dataset in csv file please guide me how do
I convert my all the categorical features in csv file and use it in naive
bayes model.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLlib-Ideal-way-to-convert-categorical-features-into-LabeledPoint-RDD-tp26125.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can we use localIterator when we need to process data in one partition?

2016-01-14 Thread unk1102
Hi I have special requirement when I need to process data in one partition at
the last  after doing many filtering,updating etc in a DataFrame. Currently
to process data in one partition I am using coalesce(1) which is killing and
painfully slow my jobs hangs for hours even 5-6 hours and I dont know how to
solve this I came across localIterator will it be helpful in my case please
share some example if it is useful or please share me idea how to solve this
problem of processing data in one partition only. Please guide.

JavaRDD maksedRDD =
sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2, Iterator>() { 
@Override 
public Iterator call(Integer ind, Iterator
rowIterator) throws Exception { 
List rowList = new ArrayList<>(); 

while (rowIterator.hasNext()) { 
Row row = rowIterator.next(); 
List rowAsList =
updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq())); 
Row updatedRow = RowFactory.create(rowAsList.toArray()); 
rowList.add(updatedRow); 
}   
return rowList.iterator(); 
} 
}, false);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-localIterator-when-we-need-to-process-data-in-one-partition-tp25974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-13 Thread unk1102
Hi thanks for the reply. Actually I cant share details as it is classified
and pretty complex to understand as it is not general problem I am trying to
solve related to database dynamic sql order execution. I need to use Spark
as my other jobs which dont use coalesce uses spark. My source data is hive
orc table partitions and with Spark it is easy to load orc files in
DataFrame. Initially I have 24 orc files/split and hence 24 partitions but
when I do sourceFrame.toJavaRDD().coalesce(1,true) this is where it stucks
hangs for hours and do nothing I am sure even it is not hitting 2GB limit as
data set size is small I dont understand why it just hangs there. I have
seen same code runs fine when dataset is smaller than regular size over
weekend.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947p25966.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102
Hi I have the following code which I run as part of thread which becomes
child job of my main Spark job it takes hours to run for large data around
1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster
with more data sets size sometimes it does not complete at all. Please guide
what I am doing wrong please help. Thanks in advance.

JavaRDD maksedRDD =
sourceRdd.coalesce(1,true).mapPartitionsWithIndex(new Function2, Iterator>() {
@Override
public Iterator call(Integer ind, Iterator
rowIterator) throws Exception {
List rowList = new ArrayList<>();

while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List rowAsList =
updateRowsMethod(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}   
return rowList.iterator();
}
}, true);



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimiz-and-make-this-code-faster-using-coalesce-1-and-mapPartitionIndex-tp25947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread unk1102
Hi I was using Spark 1.5 with Tungsten sort and now I have using Spark 1.6 I
dont see any difference I was expecting Spark 1.6 to be faster. Anyways do
we need to enable Tunsten and unsafe options or they are enabled by default
I see in documentation that default sort manager is sort I though it is
Tungsten no? Please guide. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-need-to-enabled-Tungsten-sort-in-Spark-1-6-tp25923.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread unk1102
Hi As part of Spark 1.6 release what should be ideal value or unit for 
spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it
correct? Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why is this job running since one hour?

2016-01-06 Thread unk1102
Hi I have one main Spark job which spawns multiple child spark jobs. One of
the child spark job is running for an hour and it keeps on hanging there I
have taken snap shot please see

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-this-job-running-since-one-hour-tp25899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it
take forever for large data sets of around 5-6 GB.

sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")

For small data above code works well but for large data it hangs forever
does not move on because of only one partitions has to shuffle data of GBs
please help me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Hi I have a Spark job which hangs for around 7 hours or more than that until
jobs killed out by Autosys because of time out. Data is not huge I am sure
it stucks because of GC but I cant find source code which causes GC I am
reusing almost all variable trying to minimize creating local objects though
I cant avoid creating many String objects in order to update DataFrame
values. When I see live thread debug in the executor where job is running I
see attached running/waiting threads. Please guide me to find which waiting
thread is culprit and preventing my job to finish. My code uses
dataframe.group by one around 8 fields and also uses coalese(1) twice so it
shuffles huge amounts of data in terms of GBs in each executor when I see in
the UI.


 

 

Here is the heap space error which is I dont understand how to resolve in my
code 


 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Sorry please see attached waiting thread log


 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark DataFrame callUdf does not compile?

2015-12-28 Thread unk1102

 

Hi I am trying to invoke Hive UDF using
dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it
does not compile however same call works in Spark scala console I dont
understand why. I am using Spark 1.5.2 maven source in my Java code. I have
also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
percentile_approx is located but still does not compile code please check
attached code image. Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to call mapPartitions on DataFrame?

2015-12-23 Thread unk1102
Hi I have the following code where I use mapPartitions on RDD but then I need
to convert it into DataFrame so why do I need to convert DataFrame into RDD
and back into DataFrame for just calling mapPartitions why can I call it
directly on DataFrame? 

sourceFrame.toJavaRDD().mapPartitions(new
FlatMapFunction,Row>() {

   @Override 
   public Iterable  call(Iterable rowIterator) throws Exception { 
List rowAsList = new ArrayList<>(); 
while(rowIterator.hasNext()) { 
  Row row = rowIterator.next();
  rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq())); 
  Row updatedRow = RowFactory.create(rowAsList.toArray()); 
  rowAsList.add(updatedRow);
} 
return rowAsList; 
   } 


When I see method signature it
is.mapPartitions(scala.Function1,Iterator> f,ClassTag
evidence$5)

How to I map above code into dataframe.mapPartitions please guide I am new
to Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-call-mapPartitions-on-DataFrame-tp25791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread unk1102
Hi I have spark job which reads Hive-ORC data and processes and generates csv
file in the end. Now this ORC files are hive partitions and I have around
2000 partitions to process every day. These hive partitions size is around
800 GB in HDFS. I have the following method code which I call it from a
thread spawn from spark driver. So in this case 2000 threads gets processed
and those runs painfully slow around 12 hours making huge data shuffling
each executor shuffles around 50 GB of data. I am using 40 executors of 4
core and 30 GB memory each. I am using Hadoop 2.6 and Spark 1.5.2 release. 

public void callThisFromThread() {
DataFrame sourceFrame =
hiveContext.read().format("orc").load("/path/in/hdfs");
DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz"));
DataFrame frameToProcess = sourceFrame.except(filterFrame1);
JavaRDD updatedRDD = frameToProcess.toJavaRDD().mapPartitions() {
.
}
DataFrame updatedFrame =
hiveContext.createDataFrame(updatedRdd,sourceFrame.schema());
DataFrame selectFrame = updatedFrame.select("col1","col2...","col8");
DataFrame groupFrame =
selectFrame.groupBy("col1","col2","col8").agg("..");//8 column group
by
groupFrame.coalesec(1).save();//save as csv only one file so coalesce(1)
}

Please guide me how can I optimize above code I cant avoid group by which is
evil I know I have to do group on 8 fields mentioned above.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



No support to save DataFrame in existing database table using DataFrameWriter.jdbc()

2015-12-06 Thread unk1102
Hi I would like to store/save DataFrame in a database table which is created
already and want to insert into always without creating table every time.
Unfortunately Spark API forces me to create table every time I have seen
Spark source code the following calls uses same method beneath if you
carefully see code it always expects new table to be created any idea how do
I use existing table to save dataframe

dataframe.write().jdbc(); OR dataframe.insertInto()

def jdbc(url: String, table: String, connectionProperties: Properties): Unit
= {
val props = new Properties()
extraOptions.foreach { case (key, value) =>
  props.put(key, value)
}
// connectionProperties should override settings in extraOptions
props.putAll(connectionProperties)
val conn = JdbcUtils.createConnection(url, props)

try {
  var tableExists = JdbcUtils.tableExists(conn, url, table)

  if (mode == SaveMode.Ignore && tableExists) {
return
  }

  if (mode == SaveMode.ErrorIfExists && tableExists) {
sys.error(s"Table $table already exists.")
  }

  if (mode == SaveMode.Overwrite && tableExists) {
JdbcUtils.dropTable(conn, table)
tableExists = false
  }

  // Create the table if the table didn't exist.
  if (!tableExists) {
val schema = JdbcUtils.schemaString(df, url)
val sql = s"CREATE TABLE $table ($schema)"
conn.createStatement.executeUpdate(sql)
  }
} finally {
  conn.close()
}

JdbcUtils.saveTable(df, url, table, props)
  }




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-support-to-save-DataFrame-in-existing-database-table-using-DataFrameWriter-jdbc-tp25603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why does Spark job stucks and waits for only last tasks to get finished

2015-12-03 Thread unk1102
Hi I have Spark job where I keep queue of 12 Spark jobs to execute in
parallel. Now I see job is almost completed and only task is pending and
because of last task job will keep on waiting I can see in UI. Please see
attached snaps. Please help me how to resolve Spark jobs from waiting for
last tasks and hence it is not moving into SUCCEDED state it is always
running and it is chalking other jobs not to run. Please guide.

 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-Spark-job-stucks-and-waits-for-only-last-tasks-to-get-finished-tp2.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to increase active job count to make spark job faster?

2015-10-27 Thread unk1102
Hi I have long running spark job which processes hadoop orc files and creates
one hive partitions. Even if I have created ExecturService thread pool and
use pool of 15 threads I see active job count as always 1 which makes job
slow. How do I increase active job count in UI? I remember earlier it used
to work I could see active jobs same as thread pool count but now only 1
active job is there please guide. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-active-job-count-to-make-spark-job-faster-tp25213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.5.1 hadoop 2.4 does not clear hive staging files after job finishes

2015-10-26 Thread unk1102
Hi I have spark job which creates hive table partitions I have switched to 
in spark 1.5.1 and spark 1.5.1 creates so many hive staging files and it
doesn't delete it after job finishes. Is it a bug or do I need to disable
something to prevents hive staging files from getting created or at least
delete it. Hive staging files looks like the following 

.hive-staging_hive_blabla



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-hadoop-2-4-does-not-clear-hive-staging-files-after-job-finishes-tp25203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark cant ORC files properly using 1.5.1 hadoop 2.6

2015-10-23 Thread unk1102
Hi I am having weird issue I have a Spark job which has bunch of
hiveContext.sql() and creates ORC files as part of hive tables with
partitions and it runs fine in 1.4.1 and hadoop 2.4. 

Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as
expected it does not created ORC files. But if I use Spark 1.5.1/hadoop 2.4
it works fine I dont understand the reason please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-cant-ORC-files-properly-using-1-5-1-hadoop-2-6-tp25189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to tune unavoidable group by query?

2015-10-09 Thread unk1102
Hi I have the following group by query which I tried to use it both using
DataFrame and hiveContext.sql() but both shuffles huge data and is slow. I
have around 8 fields passed in as group by fields

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla");
OR
hiveContext.sql("insert into table partitions bla bla group by
"col1","col2","col3",..."col8"");

I have tried almost all tuning parameters like tungsten,lz4 shuffle, more
shuffle.storage around 6.0 I am using Spark 1.4.0 please guide thanks in
advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-tune-unavoidable-group-by-query-tp25001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads
directory which contains 12 ORC files. Now since HDFS directory contains 12
files it will create 12 partitions by default. These directory is huge and
when ORC files gets decompressed it becomes around 10 GB how do I increase
partitions for the below code so that my Spark job runs faster and does not
hang for long time because of reading 10 GB files through shuffle in 12
partitions. Please guide. 

DataFrame df =
hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
df.select().groupby(..)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as
dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark
job UI is this persist stage runs for so long showing 10 GB of shuffle read
and 5 GB of shuffle write it takes to long to finish and because of that
sometimes my Spark job throws timeout or throws OOM and hence executors gets
killed by YARN. I am using Spark 1.4.1. I am using all sort of optimizations
like Tungsten, Kryo I have given storage.memoryFraction as 0.2 and
storage.shuffle as 0.2 also. My data is huge around 1 TB I am using default
200 partitions for spark.sql.shuffle.partitions. Please help me I am
clueless please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-dataframe-persist-StorageLevels-MEMORY-AND-DISK-SER-hangs-for-long-time-tp24981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to avoid Spark shuffle spill memory?

2015-10-06 Thread unk1102
Hi I have a Spark job which runs for around 4 hours and it shared
SparkContext and runs many child jobs. When I see each job in UI I see
shuffle spill of around 30 to 40 GB and because of that many times executors
gets lost because of using physical memory beyond limits how do I avoid
shuffle spill? I have tried almost all optimisations nothing is helping I
dont cache anything I am using Spark 1.4.1 and also using tungsten,codegen
etc  I am using spark.shuffle.storage as 0.2 and spark.storage.memory as 0.2
I tried to increase shuffle memory to 0.6 but then it halts in GC pause
causing my executor to timeout and then getting lost eventually.

Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-Spark-shuffle-spill-memory-tp24960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread unk1102
Hi I have a spark job which creates ORC files in partitions using the
following code 

dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").save("baseTable");

Above code creates successfully orc files which is readable in Spark
dataframe 

But when I try to load orc files generated using above code into hive orc
table or hive external table nothing gets printed looks like table is empty
what's wrong here I can see orc files in hdfs but hive table does not read
it please guide 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ORC-files-created-by-Spark-job-can-t-be-accessed-using-hive-table-tp24954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which uses group by query which is getting
fired from hiveContext.sql() Now I know group by is evil but my use case I
cant avoid group by I have around 7-8 fields on which I need to do group by.
Also I am using df1.except(df2) which also seems heavy operation and does
lots of shuffling please see my UI snap

 

I have tried almost all optimisation including Spark 1.5 but nothing seems
to be working and my job fails hangs because of executor will reach physical
memory limit and YARN will kill it. I have around 1TB of data to process and
it is skewed. Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can we using Spark Streaming to stream data from Hive table partitions?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which reads Hive table partitions data and
processes it independently in different threads in a driver. Now data to
process is huge in terms of TB my jobs are not scaling and running slow. So
I am thinking to use Spark Streaming as and when data is added into Hive
partitions so that I dont need to process only loaded partitions.

Can we read directly Hive table partitions data using Spark streaming?
Please guide. Also please share best practices to process TBs of data
generated everyday. Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-using-Spark-Streaming-to-stream-data-from-Hive-table-partitions-tp24915.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread unk1102
Hi I have registed my hive UDF using the following code:

hiveContext.udf().register("MyUDF",new UDF1(String,String)) {
public String call(String o) throws Execption {
//bla bla
}
},DataTypes.String);

Now I want to use above MyUDF in DataFrame. How do we use it? I know how to
use it in a sql and it works fine

hiveContext.sql(select MyUDF("test") from myTable);

My hiveContext.sql() query involves group by on multiple columns so for
scaling purpose I am trying to convert this query into DataFrame APIs

dataframe.select("col1","col2","coln").groupby(""col1","col2","coln").count();

Can we do the follwing dataframe.select(MyUDF("col1"))??? Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-registered-Hive-UDF-in-Spark-DataFrame-tp24907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to save DataFrame as a Table in Hbase?

2015-10-01 Thread unk1102
Hi anybody tried to save DataFrame in HBase? I have processed data in
DataFrame which I need to store in HBase so that my web ui can access it
from Hbase? Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-DataFrame-as-a-Table-in-Hbase-tp24903.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread unk1102
Hi I have a spark job which creates hive tables in orc format with
partitions. It works well I can read data back into hive table using hive
console. But if I try further process orc files generated by Spark job by
loading into dataframe  then I get the following exception 
Caused by: java.io.IOException: Malformed ORC file
hdfs://localhost:9000/user/hive/warehouse/partorc/part_tiny.txt. Invalid
postscript.

Dataframe df = hiveContext.read().format("orc").load(to/path);  

Please guide. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-ORC-Malformed-while-loading-into-spark-data-frame-tp24876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best practices to call small spark jobs as part of REST api

2015-09-29 Thread unk1102
Hi I would like to know any best practices to call spark jobs in rest api. My
Spark jobs returns results as json and that json can be used by UI
application.

Should we even have direct HDFS/Spark backend layer in UI for on demand
queries? Please guide. Thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-to-call-small-spark-jobs-as-part-of-REST-api-tp24872.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best practices for scheduling Spark jobs on "shared" YARN cluster using Autosys

2015-09-25 Thread unk1102
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process
they take around 6-8 hours together. I have 93 container nodes with 8 cores
each memory capacity of around 2.8 TB. Now I runs each jobs with around 30
executors with 2 cores and 20 GB each. My each jobs processes around 1 TB of
data. Now since my cluster is shared cluster many other teams spawn their
jobs along with me. So YARN kills my executors and not adding it back since
cluster is running at max capacity. I just want to know best practices in
such a resource crunching environment. These jobs runs everyday so I am
looking for innovative approaches to solve this problem. Before anyone says
we can have our own dedicated cluster so looking for alternative solutions.
Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-for-scheduling-Spark-jobs-on-shared-YARN-cluster-using-Autosys-tp24820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102
Hi I am using UDF in  hiveContext.sql("") query inside it uses group by which
forces huge data shuffle read of around 30 GB I am thinking to convert above
query into DataFrame so that I avoid using group by.

How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Hive-UDF-in-Spark-DataFrame-tp24676.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each
directory I process. So I have thousands of such directories. So I need to
merge these small small 500 part files. I am using
spark.sql.shuffle.partition as 500 and my final small files are ORC files.
Is there a way to merge orc files in Spark if not please suggest the best
way to merge files created by Spark job in hdfs please guide. Thanks much. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread unk1102
Hi I have the following Spark driver program/job which reads ORC files (i.e.
hive partitions as HDFS directories) process them in DataFrame and use them
as table in hiveContext.sql(). Job runs fine it gives correct results but it
hits physical memory limit after one hour or so and YARN kills executor and
things gets slower and slower. Please see the following code and help me
identify problem. I created 20 Threads from driver program and spawn them.
Thread logic contains lambda function which gets executed on executors.
Please guide I am new to Spark. Thanks much.

  public class DataSpark {

public static final Map dMap = new LinkedHashMap<>();

public static final String[] colNameArr = new String[]
{"_col0","col2","bla bla 45 columns"};

public static void main(String[] args) throws Exception {


Set workers = new HashSet<>();

SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
setSparkConfProperties(sparkConf);
SparkContext sc = new SparkContext(sparkConf);
final FileSystem fs = FileSystem.get(sc.hadoopConfiguration());
HiveContext hiveContext = createHiveContext(sc);

declareHiveUDFs(hiveContext);

DateTimeFormatter df = DateTimeFormat.forPattern("MMdd");
String yestday = "20150912";
hiveContext.sql(" use xyz ");
createTables(hiveContext);
DataFrame partitionFrame = hiveContext.sql(" show partitions
data partition(date=\""+ yestday + "\")");

//add csv files to distributed cache
Row[] rowArr = partitionFrame.collect();
for(Row row : rowArr) {
String[] splitArr = row.getString(0).split("/");
String entity = splitArr[0].split("=")[1];
int date =  Integer.parseInt(splitArr[1].split("=")[1]);

String sourcePath =
"/user/xyz/Hive/xyz.db/data/entity="+entity+"/date="+date;
Path spath = new Path(sourcePath);
if(fs.getContentSummary(spath).getFileCount() > 0) {
DataWorker worker = new DataWorker(hiveContext,entity,
date);
workers.add(worker);
}
}

ExecutorService executorService =
Executors.newFixedThreadPool(20);
executorService.invokeAll(workers);
executorService.shutdown();


sc.stop();
}

private static void setSparkConfProperties(SparkConf sparkConf) {
sparkConf.set("spark.rdd.compress","true");

sparkConf.set("spark.shuffle.consolidateFiles","true");
   
sparkConf.set("spark.executor.logs.rolling.maxRetainedFiles","90");
sparkConf.set("spark.executor.logs.rolling.strategy","time");
sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.shuffle.manager","tungsten-sort");

   sparkConf.set("spark.shuffle.memoryFraction","0.5");
   sparkConf.set("spark.storage.memoryFraction","0.2");

}

private static HiveContext createHiveContext(SparkContext sc) {
HiveContext hiveContext = new HiveContext(sc);
hiveContext.setConf("spark.sql.codgen","true");
hiveContext.setConf("spark.sql.unsafe.enabled","true");

hiveContext.setConf("spark.sql.shuffle.partitions","15");//need
to set this to avoid large no of small files by default spark creates 200
output part files
hiveContext.setConf("spark.sql.orc.filterPushdown","true");
return hiveContext;
}

private static void declareHiveUDFs(HiveContext hiveContext) {
hiveContext.sql("CREATE TEMPORARY FUNCTION UDF1 AS
'com.blab.blab.UDF1'");
hiveContext.sql("CREATE TEMPORARY FUNCTION UDF2 AS
'com.blab.blab.UDF2'");
}

private static void createTables(HiveContext hiveContext) {

hiveContext.sql(" create table if not exists abc blab bla );

 hiveContext.sql(" create table if not exists def blab bla );

}



private static void createBaseTableAfterProcessing(HiveContext
hiveContext,String entity,int date) {
String sourcePath =
"/user/xyz/Hive/xyz.db/data/entity="+entity+"/date="+date;

DataFrame sourceFrame =
hiveContext.read().format("orc").load(sourcePath);

//rename fields from _col* to actual column names
DataFrame renamedSourceFrame = sourceFrame.toDF(colNameArr);
//filter data from fields
DataFrame dFrame =
renamedSourceFrame.filter(renamedSourceFrame.col("col1").contains("ABC").
   
or(renamedSourceFrame.col("col1").contains("col3"))).orderBy("col2", "col3",
"col4");

 DataFrame dRemovedFrame = renamedSourceFrame.except(dslFrame);

JavaRDD dRemovedRDD = dsqlRemovedFrame.toJavaRDD();
JavaRDD sourceRdd = 

How to create broadcast variable from Java String array?

2015-09-12 Thread unk1102
Hi I have Java String array which contains 45 string which is basically
Schema

String[] fieldNames = {"col1","col2",...};

Currently I am storing above array of String in a driver static field. My
job is running slow so trying to refactor code

I am using String array in creating DataFrame

DataFrame df = sourceFrame.toDF(fieldNames);

I want to do the above using broadcast variable to that we dont ship huge
string array to executor I believe we can do something like the following to
create broadcast 

String[] brArray = sc.broadcast(fieldNames);
DataFrame df = sourceFrame.toDF(???);//how do I use above broadcast can I
use it as is by passing brArray

Please guide thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-broadcast-variable-from-Java-String-array-tp24666.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread unk1102
Hi Spark 1.5 looks promising how do we enable project tungsten for spark sql
or is it enabled by default please guide. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-Tungsten-in-Spark-1-5-for-Spark-SQL-tp24642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark rdd.mapPartitionsWithIndex() hits physical memory limit after huge data shuffle

2015-09-09 Thread unk1102
Hi I have the following Spark code which involves huge data shuffling even
though using mapPartitionswithIndex() with shuffle false. I have 2 TB of
skewed data to process and then convert rdd into dataframe and use it as
table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4
cores. I specify spark.yarn.executor.memoryOverhead as 8500 which is high
enough. I am using default settings for spark.shuffle.memoryFraction and
spark.storage.memoryFraction I also tried to change its settings but none
helped. I am using Spark 1.4.0 Please guide I am new to Spark help me
optimize the following code. Thanks in advance.

 JavaRDD indexedRdd = sourceRdd.cache().mapPartitionsWithIndex(new
Function2, Iterator>() {
@Override
public Iterator call(Integer ind, Iterator rowIterator)
throws Exception {
 List rowList = new ArrayList<>();
 while (rowIterator.hasNext()) {
 Row row = rowIterator.next();
 List rowAsList =
iterate(JavaConversions.seqAsJavaList(row.toSeq()));
 Row updatedRow = RowFactory.create(rowAsList.toArray());
 rowList.add(updatedRow);
 }   
   return rowList.iterator();
}
 }, false).union(remainingRdd);
DataFrame baseFrame =
hiveContext.createDataFrame(indexedRdd,MySchema.class);
hiveContext.registerDataFrameasTable(baseFrame,"baseTable");
hiveContext.sql("insert into abc bla bla using baseTable group by bla
bla");
 hiveContext.sql("insert into def bla bla using baseTable group by bla
bla");



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-rdd-mapPartitionsWithIndex-hits-physical-memory-limit-after-huge-data-shuffle-tp24627.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread unk1102
Hi I read many ORC files in Spark and process it those files are basically
Hive partitions. Most of the times processing goes well but for few files I
get the following exception dont know why? These files are working fine in
Hive using Hive queries. Please guide. Thanks in advance.

DataFrame df = hiveContext.read().format("orc").load("/path/in/hdfs");

java.lang.NullPointerException
at
org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:402)
at
org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:206)
at
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
at
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:238)
at
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:290)
at
org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:288)
at
org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NPE-while-reading-ORC-file-using-Spark-1-4-API-tp24609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102
Hi I have Spark job which does some processing on ORC data and stores back
ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have
the following piece of code which is using heavy shuffle memory. How do I
optimize below code? Is there anything wrong with it? It is working fine as
expected only causing slowness because of GC pause and shuffles lots of data
so hitting memory issues. Please guide I am new to Spark. Thanks in advance.

JavaRDD updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1,
false).map(new Function() {
   @Override
   public Row call(Row row) throws Exception {
List rowAsList;
Row row1 = null;
if (row != null) {
  rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
  row1 = RowFactory.create(rowAsList.toArray());
}
return row1;
   }
}).union(modifiedRDD);
DataFrame updatedDataFrame =
hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity",
"date").save("baseTable");



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-huge-data-shuffling-in-Spark-when-using-union-coalesce-1-false-on-DataFrame-tp24581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread unk1102
Hi I have a Spark dataframe which I want to save as hive table with
partitions. I tried the following two statements but they dont work I dont
see any ORC files in HDFS directory its empty. I can see baseTable is there
in Hive console but obviously its empty because of no files inside HDFS. The
following two lines saveAsTable() and insertInto()do not work.
registerDataFrameAsTable() method works but it creates in memory table and
causing OOM in my use case as I have thousands of hive partitions to
prcoess. Please guide I am new to Spark. Thanks in advance.

dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
 

dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");

//the following works but creates in memory table and seems to be reason for
OOM in my case

hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-saveAsTable-with-partitionBy-creates-no-ORC-file-in-HDFS-tp24562.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by
queries and I am running into OOM issues. So thinking of increasing value of
spark.sql.shuffle.partition from 200 default to 1000 but it is not helping.
Please correct me if I am wrong this partitions will share data shuffle load
so more the partitions less data to hold. Please guide I am new to Spark. I
am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
using hiveContext.sql() group by queries.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark executor OOM issue on YARN

2015-08-31 Thread unk1102
Hi I have Spark job and its executors hits OOM issue after some time and my
job hangs because of it followed by couple of IOException, Rpc client
disassociated, shuffle not found etc

I have tried almost everything dont know how do I solve this OOM issue
please guide I am fed up now. Here what I tried but nothing worked

-I tried 60 executors with each executor having 12 Gig/2 core
-I tried 30 executors with each executor having 20 Gig/2 core
-I tried 40 executors with each executor having 30 Gig/6 core (I also tried
7 and 8 core)
-I tried to set spark.storage.memoryFraction to 0.2 in order to solve OOM
issue I also tried to set it 0.0
-I tried to set spark.shuffle.memoryFraction to 0.4 since I need more
shuffling memory
-I tried to set spark.default.parallelism to 500,1000,1500 but it did not
help avoid OOM what is the ideal value for it?
-I also tried to set spark.sql.shuffle.partitions to 500 but it did not help
it just creates 500 output part files. Please make me understand difference
between spark.default.parallelism and spark.sql.shuffle.partitions.

My data is skewed but not that much large I dont understand why it is
hitting OOM I dont cache anything I jsut have four group by queries I am
calling using hivecontext.sql(). I have around 1000 threads which I spawn
from driver and each thread will execute these four queries.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark YARN executors are not launching when using +UseG1GC

2015-08-23 Thread unk1102
Hi I am hitting issue of long GC pauses in my Spark job and because of it
YARN is killing executors one by one and Spark job becomes slower and
slower. I came across this article where they mentioned about using G1GC I
tried to use the same command but something seems wrong

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

./spark-submit --class com.xyz.MySpark --conf
spark.executor.extraJavaOptions=-XX:MaxPermSize=512M -XX:+UseG1GC
-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms25g -Xmx25g
-XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20
--driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master
yarn-client --executor-memory 25G --executor-cores 8 --num-executors 12 
/home/myuser/myspark-1.0.jar

First it said you cant use Xms/Xmx for executor so I removed it but
executors never gets launched if I use above command please guide. Thanks in
advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-YARN-executors-are-not-launching-when-using-UseG1GC-tp24407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread unk1102
Hi this GC overhead limit error is making me crazy. I have 20 executors using
25 GB each I dont understand at all how can it throw GC overhead I also dont
that that big datasets. Once this GC error occurs in executor it will get
lost and slowly other executors getting lost because of IOException, Rpc
client disassociated, shuffle not found etc Please help me solve this I am
getting mad as I am new to Spark. Thanks in advance.

WARN scheduler.TaskSetManager: Lost task 7.0 in stage 363.0 (TID 3373,
myhost.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.spark.sql.types.UTF8String.toString(UTF8String.scala:150)
at
org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:120)
at
org.apache.spark.sql.columnar.STRING$.actualSize(ColumnType.scala:312)
at
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:224)
at
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72)
at
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80)
at
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-lost-because-of-GC-overhead-limit-exceeded-even-though-using-20-executors-using-25GB-h-tp24322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-17 Thread unk1102
val numStreams = 4
val kafkaStreams = (1 to numStreams).map { i = KafkaUtils.createStream(...)
}

In a Java in a for loop you will create four streams using
KafkaUtils.createStream() so that each receiver will run in different
threads 

for more information please visit
http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

Hope it helps!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Parallel-Processing-of-messages-from-Kafka-Java-tp24284p24297.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Calling hiveContext.sql(insert into table xyz...) in multiple threads?

2015-08-17 Thread unk1102
Hi I have around 2000 Hive source partitions to process and insert data into
same table and different partition. For e.g. I have the following query

hiveContext.sql(insert into table myTable
partition(mypartition=someparition) bla bla)

If I call above query in Spark driver program it runs fine and creates
corresponding partition in HDFS. Now this works but it is very slow takes
4-5 hours to process all 2000 partitions. So I though of using
ExecutorService and calling above query with couple of similar insert into
queries in Callable threads. Now using threads become definitely faster but
I dont see any parition created in HDFS is it concurrency issue since every
thread is trying to insert into same table but different patition I see
tasks are running very fast and getting finished but dont see any partition
in HDFS please guide I am new to Spark and Hive.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-hiveContext-sql-insert-into-table-xyz-in-multiple-threads-tp24298.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-16 Thread unk1102
Hi I have written Spark job which seems to be working fine for almost an hour
and after that executor start getting lost because of timeout I see the
following in log statement

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no
recent heartbeats: 1051638 ms exceeds timeout 100 ms 

I dont see any errors but I see above warning and because of it executor
gets removed by YARN and I see Rpc client disassociated error and
IOException connection refused and FetchFailedException

After executor gets removed I see it is again getting added and starts
working and some other executors fails again. My question is is it normal
for executor getting lost? What happens to that task lost executors were
working on? My Spark job keeps on running since it is long around 4-5 hours
I have very good cluster with 1.2 TB memory and good no of CPU cores. To
solve above time out issue I tried to increase time spark.akka.timeout to
1000 seconds but no luck. I am using the following command to run my Spark
job Please guide I am new to Spark. I am using Spark 1.4.1. Thanks in
advance.

/spark-submit --class com.xyz.abc.MySparkJob  --conf
spark.executor.extraJavaOptions=-XX:MaxPermSize=512M --driver-java-options
-XX:MaxPermSize=512m --driver-memory 4g --master yarn-client
--executor-memory 25G --executor-cores 8 --num-executors 5 --jars
/path/to/spark-job.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-lost-because-of-time-out-even-after-setting-quite-long-time-out-value-1000-seconds-tp24289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Example code to spawn multiple threads in driver program

2015-08-16 Thread unk1102
Hi I have Spark driver program which has one loop which iterates for around
2000 times and for two thousands times it executes jobs in YARN. Since loop
will do the job serially I want to introduce parallelism If I create 2000
tasks/runnable/callable in my Spark driver program will it get executed in
parallel in YARN cluster. Please guide it would be great if you can share
example code where we can run multiple threads in driver program. I am new
to Spark thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Example-code-to-spawn-multiple-threads-in-driver-program-tp24290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame.
How do we do that? I know I can use sc.hadoopFile(..) but then how do I
convert it into DataFrame

JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =
jsc.hadoopFile(hdfs://tmp/data/myformat.xyz,MyInputFormat.class,Void.class,MyRecordWritable.class);
 
JavaRDDMyRecordWritable myformatRdd =  myFormatAsPairRdd.values(); 
DataFrame myFormatAsDataframe = sqlContext.createDataFrame(myformatRdd,??); 

In above code what should I put in place of ?? I tried to put
MyRecordWritable.class but it does not work as it is not schema it is Record
Writable. Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-custom-Hadoop-InputFormat-in-DataFrame-tp24198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to create DataFrame from a binary file?

2015-08-08 Thread unk1102
Hi how do we create DataFrame from a binary file stored in HDFS? I was
thinking to use

JavaPairRDDString,PortableDataStream pairRdd =
javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
JavaRDDPortableDataStream javardd = pairRdd.values();

I can see that PortableDataStream has method called toArray which can
convert into byte array I was thinking if I have JavaRDDbyte[] can I call
the following and get DataFrame

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);

Please guide I am new to Spark. I have my own custom format which is binary
format and I was thinking if I can convert my custom format into DataFrame
using binary operations then I dont need to create my own custom Hadoop
format am I on right track? Will reading binary data into DataFrame scale?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best practices to call hiveContext in DataFrame.foreach in executor program or how to have a for loop in driver program

2015-08-05 Thread unk1102
Hi I have the following code which fires hiveContext.sql() most of the time.
My task is I want to create few table and insert values into after
processing for all hive table partition. So I first fire show partitions and
using its output in a for loop I call few methods which creates table if not
exists and does insert into using hiveContext.sql. Now we cant execute
hiveContext in executor so I have to execute this for loop in driver program
and should run serially one by one. When I submit this Spark job in YARN
cluster almost all the time my executor gets lost because of shuffle not
found exception. Now this is happening because YARN is killing my executor
because of memory overload. I dont understand why I have very less data set
for each hive partition but still it causes YARN to kill my executor. Please
guide why the following code is overkill memory will the following code do
everything in parallel and try to accommodate all hive partition data in
memory at the same time? Please guide I am blocked because of this issue.

 public static void main(String[] args) throws IOException {  
  SparkConf conf = new SparkConf();
  SparkContext sc = new SparkContext(conf);
  HiveContext hc = new HiveContext(sc);

 DataFrame partitionFrame = hiveContext.sql( show partitions dbdata
partition(date=2015-08-05));
 
 Row[] rowArr = partitionFrame.collect();
 for(Row row : rowArr) {
  String[] splitArr = row.getString(0).split(/);
  String server = splitArr[0].split(=)[1];
  String date =  splitArr[1].split(=)[1];
  String csvPath = hdfs:///user/db/ext/+server+.csv;
  if(fs.exists(new Path(csvPath))) {
   hiveContext.sql(ADD FILE  + csvPath);
  }
  createInsertIntoTableABC(hc,entity, date);
  createInsertIntoTableDEF(hc,entity, date);
  createInsertIntoTableGHI(hc,entity,date);
  createInsertIntoTableJKL(hc,entity, date);
  createInsertIntoTableMNO(hc,entity,date);
   }

}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-to-call-hiveContext-in-DataFrame-foreach-in-executor-program-or-how-to-have-a-for-loom-tp24141.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread unk1102
Hi I am having my own Hadoop custom InputFormat which I need to use in
creating DataFrame. I tried to do the following

JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =
jsc.hadoopFile(hdfs://tmp/data/myformat.xyz,MyInputFormat.class,Void.class,MyRecordWritable.class);
JavaRDDMyRecordWritable myformatRdd =  myFormatAsPairRdd.values();
DataFrame myFormatAsDataframe =
sqlContext.createDataFrame(myformatRdd,MyFormatSchema.class);
myFormatAsDataframe.show();

Above code does not work and throws exception saying
java.lang.IllegalArgumentException object is not an instance of declaring
class

My custom Hadoop InputFormat works very well with Hive,MapReduce etc How do
I make it work with Spark please guide I am new to Spark. Thank in advance.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-Spark-DataFrame-using-custom-Hadoop-InputFormat-tp24101.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread unk1102
Hi I have one Spark job which runs fine locally with less data but when I
schedule it on YARN to execute I keep on getting the following ERROR and
slowly all executors gets removed from UI and my job fails

15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on
myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on
myhost2.com: remote Rpc client disassociated
I use the following command to schedule spark job in yarn-client mode

 ./spark-submit --class com.xyz.MySpark --conf
spark.executor.extraJavaOptions=-XX:MaxPermSize=512M --driver-java-options
-XX:MaxPermSize=512m --driver-memory 3g --master yarn-client
--executor-memory 2G --executor-cores 8 --num-executors 12 
/home/myuser/myspark-1.0.jar

I dont know what is the problem please guide. I am new to Spark. Thanks in
advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-Spark-Executors-from-getting-Lost-when-using-YARN-client-mode-tp24084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread unk1102
Hi I have Spark Streaming code which streams from Kafka topic it used to work
fine but suddenly it started throwing the following exception

Exception in thread main org.apache.spark.SparkException:
org.apache.spark.SparkException: Couldn't find leader offsets for Set()
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
at scala.util.Either.fold(Either.scala:97)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:412)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:528)
at
org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
My Spark Streaming client code is very simple I just create one receiver
using the following code and trying to print messages it consumed

JavaPairInputDStreamString, String messages =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicSet);

Kafka param is only one I specify kafka.ofset.reset=largest. Kafka topic has
data I can see data using other Kafka consumers but above Spark Streaming
code throws exception saying leader offset not found. I tried both smallest
and largest offset. I wonder what happened this code used to work earlier. I
am using Spark-Streaming 1.3.1 as it was working in this version I tried in
1.4.1 and same exception. Please guide. I am new to Spark thanks in advance.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Kafka-could-not-find-leader-offset-for-Set-tp24066.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark DataFrame created from JavaRDDRow copies all columns data into first column

2015-07-22 Thread unk1102
Hi I have a DataFrame which I need to convert into JavaRDD and back to
DataFrame I have the following code

DataFrame sourceFrame =
hiveContext.read().format(orc).load(/path/to/orc/file);
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDDRow modifiedRDD = sourceFrame.toJavaRDD().map(new
FunctionRow,Row({
public Row call(Row row) throws Exception {
   if(row != null) {
   //updated row by creating new Row
   return RowFactory.create(updateRow);
   }
  return null;
});
//now I convert above JavaRDDRow into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);

sourceFrame and modifiedFrame schema is same when I call sourceFrame.show()
output is expected I see every column has corresponding values and no column
is empty but when I call modifiedFrame.show() I see all the columns values
gets merged into first column value for e.g. assume source DataFrame has 3
column as shown below

_col1_col2_col3
 ABC   10  DEF
 GHI   20  JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the
following order

_col1 _col2   _col3
ABC,10,DEF
GHI,20,JKL

As shown above all the _col1 has all the values and _col2 and _col3 is
empty. I dont know what is wrong I am doing please guide I am new to Spark
thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-created-from-JavaRDD-Row-copies-all-columns-data-into-first-column-tp23961.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi I could successfully install SparkR package into my RStudio but I could
not execute anything against sc or sqlContext. I did the following:

Sys.setenv(SPARK_HOME=/path/to/sparkE1.4.1)
.libPaths(c(file.path(Sys.getenv(SPARK_HOME),R,lib),.libPaths()))
library(SparkR)

Above code installs packages and when I type the following I get Spark
references which shows my installation is correct

 sc
Java ref type org.apache.spark.api.java.JavaSparkContext id 0

 sparkSql.init(sc)
Java ref type org.apache.spark.sql.SQLContext id 3
But when I try to execute anything against sc or sqlContext it says object
not found. For e.g.

 df  createDataFrame(sqlContext,faithful)
It fails saying sqlContext not found. Dont know what is wrong with the setup
please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-sqlContext-or-sc-not-found-in-RStudio-tp23928.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi thanks for the reply. I did download from github build it and it is
working fine I can use spark-submit etc when I use it in RStudio I dont know
why it is saying sqlContext not found

When I do the following

 sqlContext  sparkRSQL.init(sc)
Error: object sqlContext not found

if I do the following 

 sparkRSQL.init(sc)
Java ref type org.apache.spark.sql.SQLContext id 3

I dont know whats wrong here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-sqlContext-or-sc-not-found-in-RStudio-tp23928p23931.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-20 Thread unk1102
Hi I am trying to find correct way to use Spark Streaming API
streamingContext.fileStream(String,ClassK,ClassV,ClassF)

I tried to find example but could not find it anywhere in either Spark
documentation. I have to stream files in hdfs which is of custom hadoop
format.

  JavaPairDStreamVoid,MyRecordWritable input = streamingContext.
fileStream(/path/to/hdfs/stream/dir/,
Void.class,
MyRecordWritable.class,
MyInputFormat.class,
??);

How do I implement fourth argument class type Function mentioned as ??
Please guide I am new to Spark Streaming. Thank in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-correct-syntax-of-using-Spark-streamingContext-fileStream-tp23916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Store DStreams into Hive using Hive Streaming

2015-07-17 Thread unk1102
Hi I have similar use case did you found solution for this problem of loading
DStreams in Hive using Spark Streaming. Please guide. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p23885.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into
JavaDStreamString this rdd contains actual data. Now after going through
documentation and other help I have found we traverse JavaDStream using
foreachRDD

javaDStreamRdd.foreachRDD(new FunctionJavaRDDlt;String,Void() {
public void call(JavaRDDString rdd) {
//now I want to call mapPartitions on above rdd and generate new
JavaRDDMyTable
JavaRDDMyTable rdd_records = rdd.mapPartitions(
  new FlatMapFunctionIteratorlt;String, MyTable() {
  public IterableMyTable call(IteratorString stringIterator)
throws Exception {
 //create ListMyTable execute the following in while loop
 String[] fields = line.split(,);
 Record record = create Record from above fields 
 MyTable table = new MyTable();
 return table.append(record);
}
 });
}
return null;
}
});

Now my question how does above code work. I want to create JavaRDDMyTable
for each RDD of JavaDStream. How do I make sure above code will work fine
with all data and JavaRDDMyTable will contain all the data and wont lose
any previous data because of local JavaRDDMyTable.

It is like calling lambda function within lambda function. How do I make
sure local variable JavaRDD will point to contain all RDD?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-maintain-multiple-JavaRDD-created-within-another-method-like-javaStreamRDD-forEachRDD-tp23832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does Spark Streaming support streaming from a database table?

2015-07-13 Thread unk1102
Hi I did Kafka streaming through Spark streaming I have a use case where I
would like to stream data from a database table. I see JDBCRDD is there but
that is not what I am looking for I need continuous streaming like
JavaSparkStreaming which continuously runs and listens to changes in a
database table and gives me changes to process and store in HDFS. Please
guide I am new to Spark. Thank in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-Streaming-support-streaming-from-a-database-table-tp23801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org