Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Afshartous, Nick

Hello,


We have a streaming job that consistently fails with the trace below.  This is 
on an AWS EMR 4.2/Spark 1.5.2 cluster.


This ticket looks related


SPARK-8112 Received block event count through the StreamingListener can be 
negative


although it appears to have been fixed in 1.5.


Thanks for any suggestions,


--

Nick



Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: numRecords must not be negative
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:245)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:245)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)



Re: Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Afshartous, Nick

This seems to be a problem with Kafka brokers being in a bad state.  We're 
restarting Kafka to resolve.

--

Nick



From: Ted Yu 
Sent: Friday, January 22, 2016 10:38 AM
To: Afshartous, Nick
Cc: user@spark.apache.org
Subject: Re: Spark Streaming : requirement failed: numRecords must not be 
negative

Is it possible to reproduce the condition below with test code ?

Thanks

On Fri, Jan 22, 2016 at 7:31 AM, Afshartous, Nick 
> wrote:


Hello,


We have a streaming job that consistently fails with the trace below.  This is 
on an AWS EMR 4.2/Spark 1.5.2 cluster.


This ticket looks related


SPARK-8112 Received block event count through the StreamingListener can be 
negative


although it appears to have been fixed in 1.5.


Thanks for any suggestions,


--

Nick



Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: numRecords must not be negative
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:245)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:245)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)




Re: StackOverflow when computing MatrixFactorizationModel.recommendProductsForUsers

2016-01-22 Thread Ram VISWANADHA
Any help? Not sure what I am doing wrong.

Best Regards,
Ram

From: Ram VISWANADHA 
>
Date: Friday, January 22, 2016 at 10:25 AM
To: user >
Subject: StackOverflow when computing 
MatrixFactorizationModel.recommendProductsForUsers

Hi,
I am getting this StackOverflowError when fetching recommendations from ALS. 
Any help is much appreciated

int features = 100;
double alpha = 0.1;
double lambda = 0.001;
boolean implicit = true;
int iterations = 10;
ALS als = new ALS()
.setCheckpointInterval(2)
.setImplicitPrefs(true)
.setAlpha(alpha)
.setLambda(lambda)
.setIterations(iterations)
.setRank(features)
.setBlocks(1000)

.setIntermediateRDDStorageLevel(StorageLevel.MEMORY_AND_DISK_SER_2());
MatrixFactorizationModel matrixFactorizationModel = 
als.run(ratingJavaRDD);
matrixFactorizationModel.save(jsc.sc(), 
outDir+"/matrixFactorizationModel");
LOGGER.info("Fetching the recommendations for all users");
JavaRDD> recos = matrixFactorizationModel
.recommendProductsForUsers(100)
.toJavaRDD()
.coalesce(100)
.cache();

recos.checkpoint();
LOGGER.info("Number of users {} who have recommendations", recos.count());

6/01/22 18:10:21 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 

Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Cody Koeninger
Yes, you should query Kafka if you want to know the latest available
offsets.

There's code to make this straightforward in KafkaCluster.scala, but the
interface isnt public.  There's an outstanding pull request to expose the
api at

https://issues.apache.org/jira/browse/SPARK-10963

but frankly it appears unlikely that a committer will merge it.

Your options are:
 - use that api from java (since javac apparently doesn't respect scala
privacy)
- apply that patch or its equivalent and rebuild (just the
spark-streaming-kafka subproject, you don't have to redeploy spark)
- write / find equivalent code yourself

If you want to build a patched version of the subproject and need a hand,
just ask on the list.


On Fri, Jan 22, 2016 at 1:30 PM, Charles Chao 
wrote:

> Hi,
>
> I have been using DirectKafkaInputDStream in Spark Streaming to consumer
> kafka messages and it’s been working very well. Now I have the need to
> batch process messages from Kafka, for example, retrieve all messages every
> hour and process them, output to destinations like Hive or HDFS. I would
> like to use KafkaRDD and normal Spark job to achieve this, so that many of
> the logics in my streaming code can be reused.
>
> In the excellent blog post *Exactly-Once Spark Streaming from Apache
> Kafka*, there are code examples about using KafkaRDD. However, it
> requires an array of OffsetRange, which needs specify the start and end
> offset.
>
> My question is, should I write additional code to talk to Kafka and
> retrieve the latest offset for each partition every time this hourly job is
> run? Or is there any way to let KafkaUtils know to “read till latest” when
> creating the KafkaRDD?
>
> Thanks,
>
> Charles
>
>


First job is extremely slow due to executor heartbeat timeout (yarn-client)

2016-01-22 Thread Zhong Wang
Hi,

I am deploying Spark 1.6.0 using yarn-client mode in our yarn cluster.
Everything works fine, except the first job is extremely slow due to
executor heartbeat RPC timeout:

WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat


I think this might be related to our cluster's network/firewall
configuration, because the issue disappears if I use yarn-cluster mode to
deploy spark. However, I am still wondering why the first job can continue
after this timeout, and the later jobs run great without any issues.

Thanks,

Zhong


Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Thanks a lot for the help! I'll definately check out the
KafkaCluster.scala. I probably first try use that api from java, and later
try to build the subproject.

thanks,

Charles

On Fri, Jan 22, 2016 at 12:26 PM, Cody Koeninger  wrote:

> Yes, you should query Kafka if you want to know the latest available
> offsets.
>
> There's code to make this straightforward in KafkaCluster.scala, but the
> interface isnt public.  There's an outstanding pull request to expose the
> api at
>
> https://issues.apache.org/jira/browse/SPARK-10963
>
> but frankly it appears unlikely that a committer will merge it.
>
> Your options are:
>  - use that api from java (since javac apparently doesn't respect scala
> privacy)
> - apply that patch or its equivalent and rebuild (just the
> spark-streaming-kafka subproject, you don't have to redeploy spark)
> - write / find equivalent code yourself
>
> If you want to build a patched version of the subproject and need a hand,
> just ask on the list.
>
>
> On Fri, Jan 22, 2016 at 1:30 PM, Charles Chao 
> wrote:
>
>> Hi,
>>
>> I have been using DirectKafkaInputDStream in Spark Streaming to consumer
>> kafka messages and it’s been working very well. Now I have the need to
>> batch process messages from Kafka, for example, retrieve all messages every
>> hour and process them, output to destinations like Hive or HDFS. I would
>> like to use KafkaRDD and normal Spark job to achieve this, so that many of
>> the logics in my streaming code can be reused.
>>
>> In the excellent blog post *Exactly-Once Spark Streaming from Apache
>> Kafka*, there are code examples about using KafkaRDD. However, it
>> requires an array of OffsetRange, which needs specify the start and end
>> offset.
>>
>> My question is, should I write additional code to talk to Kafka and
>> retrieve the latest offset for each partition every time this hourly job is
>> run? Or is there any way to let KafkaUtils know to “read till latest” when
>> creating the KafkaRDD?
>>
>> Thanks,
>>
>> Charles
>>
>>
>


Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Muthu Jayakumar  
Date: 01/22/2016  3:50 PM  (GMT-05:00) 
To: Darren Govoni , "Sanders, Isaac B" 
, Ted Yu  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Does increasing the number of partition helps? You could try out something 3 
times what you currently have. Another trick i used was to partition the 
problem into multiple dataframes and run them sequentially and persistent the 
result and then run a union on the results. 
Hope this helps. 

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni  wrote:


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B"  
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
 wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
 wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
 wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing 

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
Does increasing the number of partition helps? You could try out something
3 times what you currently have.
Another trick i used was to partition the problem into multiple dataframes
and run them sequentially and persistent the result and then run a union on
the results.

Hope this helps.

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni  wrote:

> Me too. I had to shrink my dataset to get it to work. For us at least
> Spark seems to have scaling issues.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: "Sanders, Isaac B" 
> Date: 01/21/2016 11:18 PM (GMT-05:00)
> To: Ted Yu 
> Cc: user@spark.apache.org
> Subject: Re: 10hrs of Scheduler Delay
>
> I have run the driver on a smaller dataset (k=2, n=5000) and it worked
> quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m,
> but I am using more resources on this one.
>
> - Isaac
>
> On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:
>
> You may have seen the following on github page:
>
> Latest commit 50fdf0e  on Feb 22, 2015
>
> That was 11 months ago.
>
> Can you search for similar algorithm which runs on Spark and is newer ?
>
> If nothing found, consider running the tests coming from the project to
> determine whether the delay is intrinsic.
>
> Cheers
>
> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> That thread seems to be moving, it oscillates between a few different
>> traces… Maybe it is working. It seems odd that it would take that long.
>>
>> This is 3rd party code, and after looking at some of it, I think it might
>> not be as Spark-y as it could be.
>>
>> I linked it below. I don’t know a lot about spark, so it might be fine,
>> but I have my suspicions.
>>
>>
>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>>
>> - Isaac
>>
>> On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:
>>
>> You may have noticed the following - did this indicate prolonged
>> computation in your code ?
>>
>> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
>> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
>> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)
>>
>>
>> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <
>> sande...@rose-hulman.edu> wrote:
>>
>>> Hadoop is: HDP 2.3.2.0-2950
>>>
>>> Here is a gist (pastebin) of my versions en masse and a stacktrace:
>>> https://gist.github.com/isaacsanders/2e59131758469097651b
>>>
>>> Thanks
>>>
>>> On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:
>>>
>>> Looks like you were running on YARN.
>>>
>>> What hadoop version are you using ?
>>>
>>> Can you capture a few stack traces of the AppMaster during the delay and
>>> pastebin them ?
>>>
>>> Thanks
>>>
>>> On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
>>> sande...@rose-hulman.edu> wrote:
>>>
 The Spark Version is 1.4.1

 The logs are full of standard fair, nothing like an exception or even
 interesting [INFO] lines.

 Here is the script I am using:
 https://gist.github.com/isaacsanders/660f480810fbc07d4df2

 Thanks
 Isaac

 On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:

 Can you provide a bit more information ?

 command line for submitting Spark job
 version of Spark
 anything interesting from driver / executor logs ?

 Thanks

 On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
 sande...@rose-hulman.edu> wrote:

> Hey all,
>
> I am a CS student in the United States working on my senior thesis.
>
> My thesis uses Spark, and I am encountering some trouble.
>
> I am using https://github.com/alitouka/spark_dbscan, and to determine
> parameters, I am using the utility class they supply,
> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.
>
> I am on a 10 node cluster with one machine with 8 cores and 32G of
> memory and nine machines with 6 cores and 16G of memory.
>
> I have 442M of data, which seems like it would be a joke, but the job
> stalls at the last stage.
>
> It was stuck in Scheduler Delay for 10 hours overnight, and I have
> tried a number of things for the last couple days, but nothing seems to be
> helping.
>
> I have tried:
> - Increasing heap sizes and numbers of cores
> - More/less executors with different amounts of resources.
> - Kyro Serialization
> - FAIR Scheduling
>
> It doesn’t seem like it should require 

Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Hi,

I have been using DirectKafkaInputDStream in Spark Streaming to consumer kafka 
messages and it's been working very well. Now I have the need to batch process 
messages from Kafka, for example, retrieve all messages every hour and process 
them, output to destinations like Hive or HDFS. I would like to use KafkaRDD 
and normal Spark job to achieve this, so that many of the logics in my 
streaming code can be reused.

In the excellent blog post Exactly-Once Spark Streaming from Apache Kafka, 
there are code examples about using KafkaRDD. However, it requires an array of 
OffsetRange, which needs specify the start and end offset.

My question is, should I write additional code to talk to Kafka and retrieve 
the latest offset for each partition every time this hourly job is run? Or is 
there any way to let KafkaUtils know to "read till latest" when creating the 
KafkaRDD?

Thanks,

Charles



Re: Disable speculative retry only for specific stages?

2016-01-22 Thread Ted Yu
Looked at:
https://spark.apache.org/docs/latest/configuration.html

I don't think Spark supports per stage speculation.

On Fri, Jan 22, 2016 at 10:15 AM, Adam McElwee  wrote:

> I've used speculative execution a couple times in the past w/ good
> results, but I have one stage in my job with a non-idempotent operation in
> a `forEachPartition` block. I don't see a way to disable speculative retry
> on certain stages, but does anyone know of any tricks to help out here?
>
> Spark version: 1.6.0
>
> Thanks,
> -Adam
>


Disable speculative retry only for specific stages?

2016-01-22 Thread Adam McElwee
I've used speculative execution a couple times in the past w/ good results,
but I have one stage in my job with a non-idempotent operation in a
`forEachPartition` block. I don't see a way to disable speculative retry on
certain stages, but does anyone know of any tricks to help out here?

Spark version: 1.6.0

Thanks,
-Adam


StackOverflow when computing MatrixFactorizationModel.recommendProductsForUsers

2016-01-22 Thread Ram VISWANADHA
Hi,
I am getting this StackOverflowError when fetching recommendations from ALS. 
Any help is much appreciated

int features = 100;
double alpha = 0.1;
double lambda = 0.001;
boolean implicit = true;
int iterations = 10;
ALS als = new ALS()
.setCheckpointInterval(2)
.setImplicitPrefs(true)
.setAlpha(alpha)
.setLambda(lambda)
.setIterations(iterations)
.setRank(features)
.setBlocks(1000)

.setIntermediateRDDStorageLevel(StorageLevel.MEMORY_AND_DISK_SER_2());
MatrixFactorizationModel matrixFactorizationModel = 
als.run(ratingJavaRDD);
matrixFactorizationModel.save(jsc.sc(), 
outDir+"/matrixFactorizationModel");
LOGGER.info("Fetching the recommendations for all users");
JavaRDD> recos = matrixFactorizationModel
.recommendProductsForUsers(100)
.toJavaRDD()
.coalesce(100)
.cache();

recos.checkpoint();
LOGGER.info("Number of users {} who have recommendations", recos.count());

6/01/22 18:10:21 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 

SparkR works from command line but not from rstudio

2016-01-22 Thread Sandeep Khurana
Hello

I installed spark in a folder. I start bin/sparkR on console. Then I
execute below command and all work fine. I can see the data as well.

hivecontext <<- sparkRHive.init(sc) ;
df <- loadDF(hivecontext, "/someHdfsPath", "orc")
showDF(df)


But when I give same to rstudio, it throws the error mentioned below

rstudio code

Sys.setenv(SPARK_HOME="/home/myname/spark-1.6.0-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)

sc <- sparkR.init(master="local")
hivecontext <<- sparkRHive.init(sc) ;
df <- loadDF(hivecontext, "/someHdfsPath", "orc")
print("showing df now")
showDF(df)

Error thrown from rstudio
===

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties16/01/22 06:00:12 ERROR
RBackendHandler: createSparkContext on org.apache.spark.api.r.RRDD
failedError in invokeJava(isStatic = TRUE, className, methodName, ...)
:



 What is different in rstudio than sparkR shell ? Should I change any
setting to make it work in rstudio ?


spark-streaming with checkpointing: error with sparkOnHBase lib

2016-01-22 Thread vinay gupta
Hi,  I have a spark-streaming application which uses sparkOnHBase lib to do 
streamBulkPut()
Without checkpointing everything works fine.. But recently upon enabling 
checkpointing I got thefollowing exception - 
16/01/22 01:32:35 ERROR executor.Executor: Exception in task 0.0 in stage 39.0 
(TID 134)java.lang.ClassCastException: [B cannot be cast to 
org.apache.spark.SerializableWritable        at 
com.cloudera.spark.hbase.HBaseContext.applyCreds(HBaseContext.scala:225)        
at 
com.cloudera.spark.hbase.HBaseContext.com$cloudera$spark$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:633)
        at 
com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
        at 
com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
        at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)       
 at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)   
     at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)  
      at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)  
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)     
   at org.apache.spark.scheduler.Task.run(Task.scala:64)        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
       at java.lang.Thread.run(Thread.java:745)
Any pointers from previous users of sparkOnHbase lib ??
Thanks,-Vinay


Re: storing query object

2016-01-22 Thread Ted Yu
There have been optimizations in this area, such as:
https://issues.apache.org/jira/browse/SPARK-8125

You can also look at parent issue. 

Which Spark release are you using ?

> On Jan 22, 2016, at 1:08 AM, Gourav Sengupta  
> wrote:
> 
> 
> Hi,
> 
> I have a SPARK table (created from hiveContext) with couple of hundred 
> partitions and few thousand files. 
> 
> When I run query on the table then spark spends a lot of time (as seen in the 
> pyspark output) to collect this files from the several partitions. After this 
> the query starts running. 
> 
> Is there a way to store the object which has collected all these partitions 
> and files so that every time I restart the job I load this object instead of 
> taking  50 mins to just collect the files before starting to run the query?
> 
> 
> Please do let me know in case the question is not quite clear.
> 
> Regards,
> Gourav Sengupta 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: storing query object

2016-01-22 Thread Gourav Sengupta
Hi Ted,

I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is in
TSV format.

I do not see any affect of the work already done on this for the data
stored in HIVE as it takes around 50 mins just to collect the table
metadata over a 40 node cluster and the time is much the same for smaller
clusters of size 20.

Spending 50 mins just to collect the meta-data is fine for once, but we
should be then able to store the object (which is in memory after reading
the meta-data for the first time) so that next time we can just restore the
object instead of reading the meta-data once again. Or we should be able to
parallelize the collection of meta-data so that it does not take such a
long time.

Please advice.

Regards,
Gourav




On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu  wrote:

> There have been optimizations in this area, such as:
> https://issues.apache.org/jira/browse/SPARK-8125
>
> You can also look at parent issue.
>
> Which Spark release are you using ?
>
> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta 
> wrote:
> >
> >
> > Hi,
> >
> > I have a SPARK table (created from hiveContext) with couple of hundred
> partitions and few thousand files.
> >
> > When I run query on the table then spark spends a lot of time (as seen
> in the pyspark output) to collect this files from the several partitions.
> After this the query starts running.
> >
> > Is there a way to store the object which has collected all these
> partitions and files so that every time I restart the job I load this
> object instead of taking  50 mins to just collect the files before starting
> to run the query?
> >
> >
> > Please do let me know in case the question is not quite clear.
> >
> > Regards,
> > Gourav Sengupta
> >
> >
> >
>


Re: Spark Streaming - Custom ReceiverInputDStream ( Custom Source) In java

2016-01-22 Thread Nagu Kothapalli
Hi

Anyone have any idea on *ClassTag in spark context..*

On Fri, Jan 22, 2016 at 12:42 PM, Nagu Kothapalli  wrote:

> Hi All
>
> Facing an Issuee With CustomInputDStream object in java
>
>
>
> *public CustomInputDStream(StreamingContext ssc_, ClassTag classTag)*
> * {*
> * super(ssc_, classTag);*
> * }*
> Can you help me to create the Instance in above class with *ClassTag* In
> java
>


Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-22 Thread Joshua TAYLOR
(Apologies if this comes through twice;  I sent it once before I'd
confirmed by mailing list subscription.)

I've been having lots of trouble with DataFrames whose columns have dots in
their names today.  I know that in many places, backticks can be used to
quote column names, but the problem I'm running into now is that I can't
drop a column that has *no* dots in its name when there are *other* columns
in the table that do.  Here's some code that tries four ways of dropping
the column.  One throws a weird exception, one is a semi-expected no-op,
and the other two work.

public class SparkExample {
public static void main(String[] args) {
/* Get the spark and sql contexts.  Setting spark.ui.enabled to
false
 * keeps Spark from using its built in dependency on Jersey. */
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
.set("spark.ui.enabled", "false");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sparkContext);

/* Create a schema with two columns, one of which as no dots (a_b),
 * and the other which does (a.b). */
StructType schema = new StructType(new StructField[] {
DataTypes.createStructField("a_b", DataTypes.StringType,
false),
DataTypes.createStructField("a.c", DataTypes.IntegerType,
false)
});

/* Create an RDD of Rows, and then convert it into a DataFrame. */
List rows = Arrays.asList(
RowFactory.create("t", 2),
RowFactory.create("u", 4));
JavaRDD rdd = sparkContext.parallelize(rows);
DataFrame df = sqlContext.createDataFrame(rdd, schema);

/* Four ways to attempt dropping a_b from the DataFrame.
 * We'll try calling each one of these and looking at
 * the results (or the resulting exception). */
Function x1 = d -> d.drop("a_b");  //
exception
Function x2 = d -> d.drop("`a_b`");//
no-op
Function x3 = d -> d.drop(d.col("a_b"));   //
works
Function x4 = d -> d.drop(d.col("`a_b`")); //
works

int i=0;
for (Function x : Arrays.asList(x1, x2, x3,
x4)) {
System.out.println("Case "+i++);
try {
x.apply(df).show();
} catch (Exception e) {
e.printStackTrace(System.out);
}
}
}
}

Here's the output.  Case 1 is a no-op, which I think I can understand,
because DataFrame.drop(String) doesn't do any resolution (it doesn't need
to), so d.drop("`a_b`") doesn't do anything because there's no column whose
name is literally "`a_b`".  The third and fourth cases work, because
DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve
correctly.  But why does the first case fail?  And why with the message
that it does?  Why is it trying to resolve "a.c" at all in this case?

Case 0
org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input
columns a_b, a.c;
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at 

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-22 Thread Joshua TAYLOR
I've been having lots of trouble with DataFrames whose columns have dots in
their names today.  I know that in many places, backticks can be used to
quote column names, but the problem I'm running into now is that I can't
drop a column that has *no* dots in its name when there are *other* columns
in the table that do.  Here's some code that tries four ways of dropping
the column.  One throws a weird exception, one is a semi-expected no-op,
and the other two work.

public class SparkExample {
public static void main(String[] args) {
/* Get the spark and sql contexts.  Setting spark.ui.enabled to
false
 * keeps Spark from using its built in dependency on Jersey. */
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
.set("spark.ui.enabled", "false");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sparkContext);

/* Create a schema with two columns, one of which as no dots (a_b),
 * and the other which does (a.b). */
StructType schema = new StructType(new StructField[] {
DataTypes.createStructField("a_b", DataTypes.StringType,
false),
DataTypes.createStructField("a.c", DataTypes.IntegerType,
false)
});

/* Create an RDD of Rows, and then convert it into a DataFrame. */
List rows = Arrays.asList(
RowFactory.create("t", 2),
RowFactory.create("u", 4));
JavaRDD rdd = sparkContext.parallelize(rows);
DataFrame df = sqlContext.createDataFrame(rdd, schema);

/* Four ways to attempt dropping a_b from the DataFrame.
 * We'll try calling each one of these and looking at
 * the results (or the resulting exception). */
Function x1 = d -> d.drop("a_b");  //
exception
Function x2 = d -> d.drop("`a_b`");//
no-op
Function x3 = d -> d.drop(d.col("a_b"));   //
works
Function x4 = d -> d.drop(d.col("`a_b`")); //
works

int i=0;
for (Function x : Arrays.asList(x1, x2, x3,
x4)) {
System.out.println("Case "+i++);
try {
x.apply(df).show();
} catch (Exception e) {
e.printStackTrace(System.out);
}
}
}
}

Here's the output.  Case 1 is a no-op, which I think I can understand,
because DataFrame.drop(String) doesn't do any resolution (it doesn't need
to), so d.drop("`a_b`") doesn't do anything because there's no column whose
name is literally "`a_b`".  The third and fourth cases work, because
DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve
correctly.  But why does the first case fail?  And why with the message
that it does?  Why is it trying to resolve "a.c" at all in this case?

Case 0
org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input
columns a_b, a.c;
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org
$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 

Re: Spark Streaming: BatchDuration and Processing time

2016-01-22 Thread Lin Zhao
Hi Silvio,

Can you go into a little detail how the back pressure work? Does it block
the receiver? Or does it temporarily saves the incoming messages in
mem/disk? I have a custom actor receiver that uses store() to save dataa
to spark. Would the back pressure make store() call block?

On 1/17/16, 10:15 AM, "Silvio Fiorito" 
wrote:

>It will just queue up the subsequent batches, however if this delay is
>constant you may start losing batches. It can handle spikes in processing
>time, but if you know you're consistently running over your batch
>duration you either need to increase the duration or look at enabling
>back pressure support. See:
>http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>(1.5+).
>
>
>From: pyspark2555 
>Sent: Sunday, January 17, 2016 11:32 AM
>To: user@spark.apache.org
>Subject: Spark Streaming: BatchDuration and Processing time
>
>Hi,
>
>If BatchDuration is set to 1 second in StreamingContext and the actual
>processing time is longer than one second, then how does Spark handle
>that?
>
>For example, I am receiving a continuous Input stream. Every 1 second
>(batch
>duration), the RDDs will be processed. What if this processing time is
>longer than 1 second? What happens in the next batch duration?
>
>Thanks.
>Amit
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchD
>uration-and-Processing-time-tp25986.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: storing query object

2016-01-22 Thread Gourav Sengupta
Hi,

I have a SPARK table (created from hiveContext) with couple of hundred
partitions and few thousand files.

When I run query on the table then spark spends a lot of time (as seen in
the pyspark output) to collect this files from the several partitions.
After this the query starts running.

Is there a way to store the object which has collected all these partitions
and files so that every time I restart the job I load this object instead
of taking  50 mins to just collect the files before starting to run the
query?


Please do let me know in case the question is not quite clear.

Regards,
Gourav Sengupta


[Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-22 Thread Raju Bairishetti
Hi,


   I am very new to spark & spark-streaming. I am planning to use spark
streaming for real time processing.

   I have created a streaming context and checkpointing to hdfs directory
for recovery purposes in case of executor failures & driver failures.

I am creating Dstream with offset map for getting the data from kafka. I am
simply ignoring the offsets to understand the behavior. Whenver I restart
application driver restored from checkpoint as expected but Dstream is not
getting started from the initial offsets. Dstream was created with the last
consumed offsets instead of startign from 0 offsets for each topic
partition as I am not storing the offsets any where.

def main : Unit = {

var sparkStreamingContext =
StreamingContext.getOrCreate(SparkConstants.CHECKPOINT_DIR_LOCATION,
  () => creatingFunc())

...


}

def creatingFunc(): Unit = {

...

var offsets:Map[TopicAndPartition, Long] =
Map(TopicAndPartition("sample_sample3_json",0) -> 0)

KafkaUtils.createDirectStream[String,String, StringDecoder,
StringDecoder,
String](sparkStreamingContext, kafkaParams, offsets, messageHandler)

...
}

I want to get control over offset management at event level instead of RDD
level to make sure that at least once delivery to end system.

As per my understanding, every RDD or RDD partition will stored in hdfs as
a file If I choose to use HDFS as output. If I use 1sec as batch interval
then it will be ended up having huge number of small files in HDFS. Having
small files in HDFS will leads to lots of other issues.
Is there any way to write multiple RDDs into single file? Don't have muh
idea about *coalesce* usage. In the worst case, I can merge all small files
in HDFS in regular intervals.

Thanks...

--
Thanks
Raju Bairishetti
www.lazada.com


Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B"  
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
 wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
 wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
 wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac





















































spark streaming input rate strange

2016-01-22 Thread patcharee

Hi,

I have a streaming application with
- 1 sec interval
- accept data from a simulation through MulticastSocket

The simulation sent out data using multiple clients/threads every 1 sec 
interval. The input rate accepted by the streaming looks strange.
- When clients = 10,000 the event rate raises up to 10,000, stays at 
10,000 a while and drops to about 7000-8000.
- When clients = 20,000 the event rate raises up to 20,000, stays at 
20,000 a while and drops to about 15000-17000. The same pattern


Processing time is just about 400 ms.

Any ideas/suggestions?

Thanks,
Patcharee


Re: storing query object

2016-01-22 Thread Ted Yu
In SQLConf.scala , I found this:

  val PARALLEL_PARTITION_DISCOVERY_THRESHOLD = intConf(
key = "spark.sql.sources.parallelPartitionDiscovery.threshold",
defaultValue = Some(32),
doc = "The degree of parallelism for schema merging and partition
discovery of " +
  "Parquet data sources.")

But looks like it may not help your case.

FYI

On Fri, Jan 22, 2016 at 3:09 AM, Gourav Sengupta 
wrote:

> Hi Ted,
>
> I am using SPARK 1.5.2 as available currently in AWS EMR 4x. The data is
> in TSV format.
>
> I do not see any affect of the work already done on this for the data
> stored in HIVE as it takes around 50 mins just to collect the table
> metadata over a 40 node cluster and the time is much the same for smaller
> clusters of size 20.
>
> Spending 50 mins just to collect the meta-data is fine for once, but we
> should be then able to store the object (which is in memory after reading
> the meta-data for the first time) so that next time we can just restore the
> object instead of reading the meta-data once again. Or we should be able to
> parallelize the collection of meta-data so that it does not take such a
> long time.
>
> Please advice.
>
> Regards,
> Gourav
>
>
>
>
> On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu  wrote:
>
>> There have been optimizations in this area, such as:
>> https://issues.apache.org/jira/browse/SPARK-8125
>>
>> You can also look at parent issue.
>>
>> Which Spark release are you using ?
>>
>> > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta 
>> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I have a SPARK table (created from hiveContext) with couple of hundred
>> partitions and few thousand files.
>> >
>> > When I run query on the table then spark spends a lot of time (as seen
>> in the pyspark output) to collect this files from the several partitions.
>> After this the query starts running.
>> >
>> > Is there a way to store the object which has collected all these
>> partitions and files so that every time I restart the job I load this
>> object instead of taking  50 mins to just collect the files before starting
>> to run the query?
>> >
>> >
>> > Please do let me know in case the question is not quite clear.
>> >
>> > Regards,
>> > Gourav Sengupta
>> >
>> >
>> >
>>
>
>


?????? retrieve cell value from a rowMatrix.

2016-01-22 Thread zhangjp
Hi  Srini,
   If you want to get value like the following example using scala, some other 
language also like this 
 " mat.rows.collect().apply(i)
  val cov = mat.computeCovariance()
  cov.apply(i, j)
 "
   mat is RowMatrix type and  the cov is Matrix type.
  

 

 --  --
  ??: "Srivathsan Srinivas";;
 : 2016??1??22??(??) 1:12
 ??: "zhangjp"<592426...@qq.com>; 
 : "user"; 
 : Re: retrieve cell value from a rowMatrix.

 

 Hi Zhang,I am new to Scala and Spark. I am not a Java guy (more of Python 
and R guy). Just began playing with matrices in MlLib and looks painful to do 
simple things. If you can show me a small example, it would help. Apply 
function is not available in RowMatrix. 
 

 For eg.,
 

 import org.apache.spark.mllib.linalg.distributed.RowMatrix
 

 /* retrive a cell value */
 def getValue(m: RowMatrix): Double = {
???
 }
 

 

 Likewise, I have trouble adding two RowMatrices
 

 /* add two RowMatrices */
 def addRowMatrices(a: RowMatrix, b: RowMatrix): RowMatrix = {
 

 }
 

 

 From what I have read on Stackoverflow and other places is that such simple 
things are not exposed in MlLib. But, they are heavily used in the underlying 
Breeze libraries. Hence, one should convert the rowMatrics to its Breeze 
equivalent, do the required operations and convert it back to rowMatrix. I am 
still learning how to do this kind of conversion back and forth. If you have 
small examples, it would be very helpful.
 

 

 Thanks!
 Srini.

 
 On Wed, Jan 20, 2016 at 10:08 PM, zhangjp <592426...@qq.com> wrote:
   
  use apply(i,j) function.
 can u know how  to save matrix to a file using java language?

 

 --  --
  ??: "Srivathsan Srinivas";;
 : 2016??1??21??(??) 9:04
 ??: "user"; 
 
 : retrieve cell value from a rowMatrix.

   

 Hi, Is there a way to retrieve the cell value of a rowMatrix? Like m(i,j)? 
The docs say that the indices are long. Maybe I am doing something wrong...but, 
there doesn't seem to be any such direct method.
 

 Any suggestions?
 

-- 
 Thanks,
Srini. 


 






 

-- 
 Thanks,
Srini.

Application SUCCESS/FAILURE status using spark API

2016-01-22 Thread Raghvendra Singh
Hi,

Does any body know how can we get the application status of a spark app
using the API ?

Currently its giving only the completed status as true/false.

I am trying to build a application manager kind of thing where one can see
the apps deployed and their status, and do some actions based on the status
of these apps.


Please suggest something


Thanks & Regards
Raghvendra


Re: Concurrent Spark jobs

2016-01-22 Thread Eugene Morozov
Emlyn,

Have you considered using pools?
http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools

I haven't tried that by myself, but it looks like pool setting is applied
per thread so that means it's possible to configure fair scheduler, so that
more, than one job is on a go. Although each of them would probably use
less number of workers...

Hope this helps.
--
Be well!
Jean Morozov

On Thu, Jan 21, 2016 at 3:23 PM, emlyn  wrote:

> Thanks for the responses (not sure why they aren't showing up on the list).
>
> Michael wrote:
> > The JDBC wrapper for Redshift should allow you to follow these
> > instructions. Let me know if you run into any more issues.
> >
> http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-td2681.html
>
> I'm not sure that this solves my problem - if I understand it correctly,
> this is to split a database write over multiple concurrent connections (one
> from each partition), whereas what I want is to allow other tasks to
> continue running on the cluster while the the write to Redshift is taking
> place.
> Also I don't think it's good practice to load data into Redshift with
> INSERT
> statements over JDBC - it is recommended to use the bulk load commands that
> can analyse the data and automatically set appropriate compression etc on
> the table.
>
>
> Rajesh wrote:
> > Just a thought. Can we use Spark Job Server and trigger jobs through rest
> > apis. In this case, all jobs will share same context and run the jobs
> > parallel.
> > If any one has other thoughts please share
>
> I'm not sure this would work in my case as they are not completely separate
> jobs, but just different outputs to Redshift, that share intermediate
> results. Running them as completely separate jobs would mean recalculating
> the intermediate results for each output. I suppose it might be possible to
> persist the intermediate results somewhere, and then delete them once all
> the jobs have run, but that is starting to add a lot of complication which
> I'm not sure is justified.
>
>
> Maybe some pseudocode would help clarify things, so here is a very
> simplified view of our Spark application:
>
> // load and transform data, then cache the result
> df1 = transform1(sqlCtx.read().options(...).parquet('path/to/data'))
> df1.cache()
>
> // perform some further transforms of the cached data
> df2 = transform2(df1)
> df3 = transform3(df1)
>
> // write the final data out to Redshift
> df2.write().options(...).(format "com.databricks.spark.redshift").save()
> df3.write().options(...).(format "com.databricks.spark.redshift").save()
>
>
> When the application runs, the steps are executed in the following order:
> - scan parquet folder
> - transform1 executes
> - df1 stored in cache
> - transform2 executes
> - df2 written to Redshift (while cluster sits idle)
> - transform3 executes
> - df3 written to Redshift
>
> I would like transform3 to begin executing as soon as the cluster has
> capacity, without having to wait for df2 to be written to Redshift, so I
> tried rewriting the last two lines as (again pseudocode):
>
> f1 = future{df2.write().options(...).(format
> "com.databricks.spark.redshift").save()}.execute()
> f2 = future{df3.write().options(...).(format
> "com.databricks.spark.redshift").save()}.execute()
> f1.get()
> f2.get()
>
> In the hope that the first write would no longer block the following steps,
> but instead it fails with a TimeoutException (see stack trace in previous
> message). Is there a way to start the different writes concurrently, or is
> that not possible in Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011p26030.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ted Yu
The class path formations on driver and executors are different.

Cheers

On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale  wrote:

> Is this issue only when the computations are in distributed mode ?
> If I do (pseudo code) :
> rdd.collect.call_to_hbase  I dont get this error,
>
> but if I do :
> rdd.call_to_hbase.collect it throws this error.
>
> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale 
> wrote:
>
>> Unfortunately I cannot at this moment (not a decision I can make) :(
>>
>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>>
>>> I am not aware of a workaround.
>>>
>>> Can you upgrade to 0.98.4+ release ?
>>>
>>> Cheers
>>>
>>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
>>> wrote:
>>>
 Hi Ted,

 Thanks for responding.
 Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
 HADOOP_CLASSPATH didnt work for me.

 On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:

> 0.98.0 didn't have fix from HBASE-8
>
> Please upgrade your hbase version and try again.
>
> If still there is problem, please pastebin the stack trace.
>
> Thanks
>
> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
> wrote:
>
>>
>> I have posted this on hbase user list but i thought makes more sense
>> on spark user list.
>> I am able to read the table in yarn-client mode from spark-shell but
>> I have exhausted all online forums for options to get it working in the
>> yarn-cluster mode through spark-submit.
>>
>> I am using this code-example
>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>  to
>> read a hbase table using Spark with the only change of adding the
>> hbase.zookeeper.quorum through code as it is not picking it from the
>> hbase-site.xml.
>>
>> Spark 1.5.3
>>
>> HBase 0.98.0
>>
>>
>> Facing this error -
>>
>>  16/01/20 12:56:59 WARN 
>> client.ConnectionManager$HConnectionImplementation: Encountered problems 
>> when prefetch hbase:meta table:
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: class 
>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: 
>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>
>> at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>> at 
>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>> at 
>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at 
>> 

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
I tried --jars which supposedly does that but that did not work.

On Fri, Jan 22, 2016 at 4:33 PM Ajinkya Kale  wrote:

> Hi Ted,
> Is there a way for the executors to have the hbase-protocol jar on their
> classpath ?
>
> On Fri, Jan 22, 2016 at 4:00 PM Ted Yu  wrote:
>
>> The class path formations on driver and executors are different.
>>
>> Cheers
>>
>> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale 
>> wrote:
>>
>>> Is this issue only when the computations are in distributed mode ?
>>> If I do (pseudo code) :
>>> rdd.collect.call_to_hbase  I dont get this error,
>>>
>>> but if I do :
>>> rdd.call_to_hbase.collect it throws this error.
>>>
>>> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale 
>>> wrote:
>>>
 Unfortunately I cannot at this moment (not a decision I can make) :(

 On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:

> I am not aware of a workaround.
>
> Can you upgrade to 0.98.4+ release ?
>
> Cheers
>
> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
> wrote:
>
>> Hi Ted,
>>
>> Thanks for responding.
>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>> HADOOP_CLASSPATH didnt work for me.
>>
>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>
>>> 0.98.0 didn't have fix from HBASE-8
>>>
>>> Please upgrade your hbase version and try again.
>>>
>>> If still there is problem, please pastebin the stack trace.
>>>
>>> Thanks
>>>
>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >> > wrote:
>>>

 I have posted this on hbase user list but i thought makes more
 sense on spark user list.
 I am able to read the table in yarn-client mode from spark-shell
 but I have exhausted all online forums for options to get it working 
 in the
 yarn-cluster mode through spark-submit.

 I am using this code-example
 http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
  to
 read a hbase table using Spark with the only change of adding the
 hbase.zookeeper.quorum through code as it is not picking it from the
 hbase-site.xml.

 Spark 1.5.3

 HBase 0.98.0


 Facing this error -

  16/01/20 12:56:59 WARN 
 client.ConnectionManager$HConnectionImplementation: Encountered 
 problems when prefetch hbase:meta table:
 org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
 attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
 org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: class 
 com.google.protobuf.HBaseZeroCopyByteString cannot access its 
 superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 
 GMT-07:00 2016, 
 org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: 
 com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 
 GMT-07:00 2016, 
 org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
 java.lang.IllegalAccessError: 
 com/google/protobuf/HBaseZeroCopyByteString

 at 
 org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
 at 
 org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
 at 
 org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
 at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
 at 
 org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
 at 
 

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Is this issue only when the computations are in distributed mode ?
If I do (pseudo code) :
rdd.collect.call_to_hbase  I dont get this error,

but if I do :
rdd.call_to_hbase.collect it throws this error.

On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale  wrote:

> Unfortunately I cannot at this moment (not a decision I can make) :(
>
> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>
>> I am not aware of a workaround.
>>
>> Can you upgrade to 0.98.4+ release ?
>>
>> Cheers
>>
>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
>> wrote:
>>
>>> Hi Ted,
>>>
>>> Thanks for responding.
>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>>> HADOOP_CLASSPATH didnt work for me.
>>>
>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>>
 0.98.0 didn't have fix from HBASE-8

 Please upgrade your hbase version and try again.

 If still there is problem, please pastebin the stack trace.

 Thanks

 On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
 wrote:

>
> I have posted this on hbase user list but i thought makes more sense
> on spark user list.
> I am able to read the table in yarn-client mode from spark-shell but I
> have exhausted all online forums for options to get it working in the
> yarn-cluster mode through spark-submit.
>
> I am using this code-example
> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>  to
> read a hbase table using Spark with the only change of adding the
> hbase.zookeeper.quorum through code as it is not picking it from the
> hbase-site.xml.
>
> Spark 1.5.3
>
> HBase 0.98.0
>
>
> Facing this error -
>
>  16/01/20 12:56:59 WARN 
> client.ConnectionManager$HConnectionImplementation: Encountered problems 
> when prefetch hbase:meta table:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: class 
> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: 
> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
> at 
> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
> at 
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at 
> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>
> I tried adding the hbase protocol jar on spar-defaults.conf and in the
> driver-classpath as suggested here
> 

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Hi Ted,
Is there a way for the executors to have the hbase-protocol jar on their
classpath ?

On Fri, Jan 22, 2016 at 4:00 PM Ted Yu  wrote:

> The class path formations on driver and executors are different.
>
> Cheers
>
> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale 
> wrote:
>
>> Is this issue only when the computations are in distributed mode ?
>> If I do (pseudo code) :
>> rdd.collect.call_to_hbase  I dont get this error,
>>
>> but if I do :
>> rdd.call_to_hbase.collect it throws this error.
>>
>> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale 
>> wrote:
>>
>>> Unfortunately I cannot at this moment (not a decision I can make) :(
>>>
>>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>>>
 I am not aware of a workaround.

 Can you upgrade to 0.98.4+ release ?

 Cheers

 On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
 wrote:

> Hi Ted,
>
> Thanks for responding.
> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
> HADOOP_CLASSPATH didnt work for me.
>
> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>
>> 0.98.0 didn't have fix from HBASE-8
>>
>> Please upgrade your hbase version and try again.
>>
>> If still there is problem, please pastebin the stack trace.
>>
>> Thanks
>>
>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>> wrote:
>>
>>>
>>> I have posted this on hbase user list but i thought makes more sense
>>> on spark user list.
>>> I am able to read the table in yarn-client mode from spark-shell but
>>> I have exhausted all online forums for options to get it working in the
>>> yarn-cluster mode through spark-submit.
>>>
>>> I am using this code-example
>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>  to
>>> read a hbase table using Spark with the only change of adding the
>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>> hbase-site.xml.
>>>
>>> Spark 1.5.3
>>>
>>> HBase 0.98.0
>>>
>>>
>>> Facing this error -
>>>
>>>  16/01/20 12:56:59 WARN 
>>> client.ConnectionManager$HConnectionImplementation: Encountered 
>>> problems when prefetch hbase:meta table:
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: class 
>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its 
>>> superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 
>>> GMT-07:00 2016, 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: 
>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 
>>> GMT-07:00 2016, 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>> java.lang.IllegalAccessError: 
>>> com/google/protobuf/HBaseZeroCopyByteString
>>>
>>> at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>> at 
>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>>> at 
>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>>> at 
>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>>> at 
>>> org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>> at 
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>> at 
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>> at scala.Option.getOrElse(Option.scala:120)

RE: Spark Cassandra clusters

2016-01-22 Thread Mohammed Guller
Vivek,

By default, Cassandra uses ¼ of the system memory, so in your case, it will be 
around 8GB, which is fine.

If you have more Cassandra related question, it is better to post it on the 
Cassandra mailing list. Also feel free to email me directly.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, January 22, 2016 6:37 PM
To: vivek.meghanat...@wipro.com
Cc: user
Subject: Re: Spark Cassandra clusters

I am not Cassandra developer :-)

Can you use http://search-hadoop.com/ or ask on Cassandra mailing list.

Cheers

On Fri, Jan 22, 2016 at 6:35 PM, 
> wrote:

Thanks Ted, also what is the suggested memory setting for Cassandra process?

Regards
Vivek
On Sat, Jan 23, 2016 at 7:57 am, Ted Yu 
> wrote:

From your description, putting Cassandra daemon on Spark cluster should be 
feasible.

One aspect to be measured is how much locality can be achieved in this setup - 
Cassandra is distributed NoSQL store.

Cheers

On Fri, Jan 22, 2016 at 6:13 PM, 
> wrote:

+ spark standalone cluster
On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) 
> wrote:


We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory. 
10 nodes for spark another 9nodes for Cassandra.
We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x Cassandra).
Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the 
default setting of Cassandra, should we be increasing the JVM memory?

we have 9 streaming jobs the core usage varies from 2-6 and memory usage from 1 
- 4 gb.

We have budget to use higher CPU or higher memory systems hence was planning to 
have them together on more efficient nodes.

Regards
Vivek
On Sat, Jan 23, 2016 at 7:13 am, Ted Yu 
> wrote:

Can you give us a bit more information ?

How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using

Thanks

On Fri, Jan 22, 2016 at 5:37 PM, 
> wrote:

Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and 
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad 
result for the spark streaming jobs.
Please let us know.

Regards
Vivek
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com



How to send a file to database using spark streaming

2016-01-22 Thread Sree Eedupuganti
New to Spark Streaming. My question is i want to load the XML files to
database [cassandra] using spark streaming.Any suggestions please.Thanks in
Advance.

-- 
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited


looking for a spark admin consultant/contractor

2016-01-22 Thread Andy Davidson
I am working on a proof of concept using spark. I set up a small cluster on
AWS. I looking for some part time help with administration.

Kind Regards

Andy





Re: SparkR works from command line but not from rstudio

2016-01-22 Thread Sandeep Khurana
This problem is fixed by restarting R from R studio. Now see

16/01/22 08:08:38 INFO HiveMetaStore: No user is added in admin role,
since config is empty16/01/22 08:08:38 ERROR RBackendHandler: 
on org.apache.spark.sql.hive.HiveContext failedError in
value[[3L]](cond) : Spark SQL is not built with Hive support


 in rstudio while running same code and hive-site.xml is present in the .
It works in sparkR shell.

Any ideas?

On Fri, Jan 22, 2016 at 4:35 PM, Sandeep Khurana 
wrote:

> Hello
>
> I installed spark in a folder. I start bin/sparkR on console. Then I
> execute below command and all work fine. I can see the data as well.
>
> hivecontext <<- sparkRHive.init(sc) ;
> df <- loadDF(hivecontext, "/someHdfsPath", "orc")
> showDF(df)
>
>
> But when I give same to rstudio, it throws the error mentioned below
>
> rstudio code
> 
> Sys.setenv(SPARK_HOME="/home/myname/spark-1.6.0-bin-hadoop2.6")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
>
> sc <- sparkR.init(master="local")
> hivecontext <<- sparkRHive.init(sc) ;
> df <- loadDF(hivecontext, "/someHdfsPath", "orc")
> print("showing df now")
> showDF(df)
>
> Error thrown from rstudio
> ===
>
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties16/01/22 06:00:12 ERROR 
> RBackendHandler: createSparkContext on org.apache.spark.api.r.RRDD 
> failedError in invokeJava(isStatic = TRUE, className, methodName, ...) :
>
>
>
>  What is different in rstudio than sparkR shell ? Should I change any
> setting to make it work in rstudio ?
>
>
>


-- 
Architect
Infoworks.io
http://Infoworks.io


Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn
client mode. I set it to 13% of the executor memory.

Also quite helpful was increasing the total overall executor memory.

It will be great when tungsten enhancements make there way into RDDs.

Thanks!

Arun

On Thu, Jan 21, 2016 at 6:19 PM, Arun Luthra  wrote:

> Two changes I made that appear to be keeping various errors at bay:
>
> 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3ccacbyxkld8qasymj2ghk__vttzv4gejczcqfaw++s1d5te1d...@mail.gmail.com%3E
> . Even though I couldn't find the same error in my yarn log.
>
> 2) very important: I ran coalesce(1000) on the RDD at the start of the
> DAG. I know keeping the # of partitions lower is helpful, based on past
> experience with groupByKey. I haven't run this pipeline in a bit so that
> rule of thumb was not forefront in my mind.
>
> On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra 
> wrote:
>
>> Looking into the yarn logs for a similar job where an executor was
>> associated with the same error, I find:
>>
>> ...
>> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
>> connection to (SERVER), creating a new one.
>> 16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
>> beginning fetch of 46 outstanding blocks*
>> *java.io.IOException: Failed to connect to (SERVER)*
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
>> at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>> at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
>> at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:112)
>> at
>> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
>> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> *Caused by: java.net.ConnectException: Connection refused:* (SERVER)
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>> at
>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>> at
>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> 

RE: Date / time stuff with spark.

2016-01-22 Thread Spencer, Alex (Santander)
Hi Andy,

Sorry this is in Scala but you may be able to do something similar? I use 
Joda's DateTime class. I ran into a lot of difficulties with the serializer, 
but if you are an admin on the box you'll have less issues by adding in some 
Kryo serializers.

import org.joda.time

val dateFormat = format.DateTimeFormat.forPattern("-MM-dd");
val tranDate = dateFormat.parseDateTime(someDateString)


Alex
 
-Original Message-
From: Andrew Holway [mailto:andrew.hol...@otternetworks.de] 
Sent: 21 January 2016 19:25
To: user@spark.apache.org
Subject: Date / time stuff with spark.

Hello,

I am importing this data from HDFS into a data frame with 
sqlContext.read.json().

{“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
"CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
"2016-01-20T23:59:53+00:00”}

I want to do some date/time operations on this json data but I cannot find 
clear documentation on how to

A) specify the “time” field as a date/time in the schema.
B) the format the date should be in to be correctly in the raw data for an easy 
import.

Cheers,

Andrew

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Emails aren't always secure, and they may be intercepted or changed after
they've been sent. Santander doesn't accept liability if this happens. If you
think someone may have interfered with this email, please get in touch with the
sender another way. This message doesn't create or change any contract.
Santander doesn't accept responsibility for damage caused by any viruses
contained in this email or its attachments. Emails may be monitored. If you've
received this email by mistake, please let the sender know at once that it's
gone to the wrong person and then destroy it without copying, using, or telling
anyone about its contents.
Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc Reg.
No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 3AN.
Registered in England. www.santander.co.uk. Authorised by the Prudential
Regulation Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. No.
02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA Reg.
No. 154210. You can check this on the Financial Services Register by visiting
the FCA’s website www.fca.org.uk/register or by contacting the FCA on 0800 111
6768. Santander UK plc is also licensed by the Financial Supervision Commission
of the Isle of Man for its branch in the Isle of Man. Deposits held with the
Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
Scheme as set out in the Isle of Man Depositors’ Compensation Scheme Regulations
2010. In the Isle of Man, Santander UK plc’s principal place of business is at
19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame logo
are registered trademarks.
Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
Corporate & Commercial is a brand name used by Santander UK plc, Abbey National
Treasury Services plc and Santander Asset Finance plc.
Ref:[PDB#1-4A]


Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-22 Thread Cody Koeninger
Offsets are stored in the checkpoint.  If you want to manage offsets
yourself, don't restart from the checkpoint, specify the starting offsets
when you create the stream.

Have you read / watched the materials linked from

https://github.com/koeninger/kafka-exactly-once

Regarding the small files problem, either don't use HDFS, or use something
like filecrush for merging.

On Fri, Jan 22, 2016 at 3:03 AM, Raju Bairishetti  wrote:

> Hi,
>
>
>I am very new to spark & spark-streaming. I am planning to use spark
> streaming for real time processing.
>
>I have created a streaming context and checkpointing to hdfs directory
> for recovery purposes in case of executor failures & driver failures.
>
> I am creating Dstream with offset map for getting the data from kafka. I
> am simply ignoring the offsets to understand the behavior. Whenver I
> restart application driver restored from checkpoint as expected but Dstream
> is not getting started from the initial offsets. Dstream was created with
> the last consumed offsets instead of startign from 0 offsets for each topic
> partition as I am not storing the offsets any where.
>
> def main : Unit = {
>
> var sparkStreamingContext = 
> StreamingContext.getOrCreate(SparkConstants.CHECKPOINT_DIR_LOCATION,
>   () => creatingFunc())
>
> ...
>
>
> }
>
> def creatingFunc(): Unit = {
>
> ...
>
> var offsets:Map[TopicAndPartition, Long] = 
> Map(TopicAndPartition("sample_sample3_json",0) -> 0)
>
> KafkaUtils.createDirectStream[String,String, StringDecoder, 
> StringDecoder,
> String](sparkStreamingContext, kafkaParams, offsets, messageHandler)
>
> ...
> }
>
> I want to get control over offset management at event level instead of RDD
> level to make sure that at least once delivery to end system.
>
> As per my understanding, every RDD or RDD partition will stored in hdfs as
> a file If I choose to use HDFS as output. If I use 1sec as batch interval
> then it will be ended up having huge number of small files in HDFS. Having
> small files in HDFS will leads to lots of other issues.
> Is there any way to write multiple RDDs into single file? Don't have muh
> idea about *coalesce* usage. In the worst case, I can merge all small files
> in HDFS in regular intervals.
>
> Thanks...
>
> --
> Thanks
> Raju Bairishetti
> www.lazada.com
>
>
>
>


Re: Date / time stuff with spark.

2016-01-22 Thread Durgesh Verma
Option B is good too, have date as timestamp and format later.

Thanks,
-Durgesh

> On Jan 22, 2016, at 9:50 AM, Spencer, Alex (Santander) 
>  wrote:
> 
> Hi Andy,
> 
> Sorry this is in Scala but you may be able to do something similar? I use 
> Joda's DateTime class. I ran into a lot of difficulties with the serializer, 
> but if you are an admin on the box you'll have less issues by adding in some 
> Kryo serializers.
> 
> import org.joda.time
> 
>val dateFormat = format.DateTimeFormat.forPattern("-MM-dd");
>val tranDate = dateFormat.parseDateTime(someDateString)
> 
> 
> Alex
> Â 
> -Original Message-
> From: Andrew Holway [mailto:andrew.hol...@otternetworks.de] 
> Sent: 21 January 2016 19:25
> To: user@spark.apache.org
> Subject: Date / time stuff with spark.
> 
> Hello,
> 
> I am importing this data from HDFS into a data frame with 
> sqlContext.read.json().
> 
> {“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
> "CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
> "2016-01-20T23:59:53+00:00”}
> 
> I want to do some date/time operations on this json data but I cannot find 
> clear documentation on how to
> 
> A) specify the “time” field as a date/time in the schema.
> B) the format the date should be in to be correctly in the raw data for an 
> easy import.
> 
> Cheers,
> 
> Andrew
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
> 
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If you
> think someone may have interfered with this email, please get in touch with 
> the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If you've
> received this email by mistake, please let the sender know at once that it's
> gone to the wrong person and then destroy it without copying, using, or 
> telling
> anyone about its contents.
> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc 
> Reg.
> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 
> 3AN.
> Registered in England. www.santander.co.uk. Authorised by the Prudential
> Regulation Authority and regulated by the Financial Conduct Authority and the
> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. 
> No.
> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA 
> Reg.
> No. 154210. You can check this on the Financial Services Register by visiting
> the FCA’s website www.fca.org.uk/register or by contacting the FCA on 0800 
> 111
> 6768. Santander UK plc is also licensed by the Financial Supervision 
> Commission
> of the Isle of Man for its branch in the Isle of Man. Deposits held with the
> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme 
> Regulations
> 2010. In the Isle of Man, Santander UK plc’s principal place of business is 
> at
> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame 
> logo
> are registered trademarks.
> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
> Corporate & Commercial is a brand name used by Santander UK plc, Abbey 
> National
> Treasury Services plc and Santander Asset Finance plc.
> Ref:[PDB#1-4A]
> ТÐÐ¥FòVç7V'67&–ÂRÖÖ–âW6W"×Vç7V'67&–7&²æ6†Ræ÷Фf÷"FF—F–öæÂ6öÖÖæG2ÂRÖÖ–âW6W"Ö†VÇ7&²æ6†Ræ÷Ð
>  Ð

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Ted Yu
Is it possible to reproduce the condition below with test code ?

Thanks

On Fri, Jan 22, 2016 at 7:31 AM, Afshartous, Nick 
wrote:

>
> Hello,
>
>
> We have a streaming job that consistently fails with the trace below.
> This is on an AWS EMR 4.2/Spark 1.5.2 cluster.
>
>
> This ticket looks related
>
>
> SPARK-8112 Received block event count through the StreamingListener
> can be negative
>
>
> although it appears to have been fixed in 1.5.
>
>
> Thanks for any suggestions,
>
>
> --
>
> Nick
>
>
>
> Exception in thread "main" java.lang.IllegalArgumentException: requirement
> failed: numRecords must not be negative
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
> at
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:247)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:245)
> at scala.util.Try$.apply(Try.scala:161)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:245)
> at org.apache.spark.streaming.scheduler.JobGenerator.org
> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:87)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:86)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
>


Re: Date / time stuff with spark.

2016-01-22 Thread Ted Yu
Related thread:

http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime

FYI

On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) <
alex.spen...@santander.co.uk.invalid> wrote:

> Hi Andy,
>
> Sorry this is in Scala but you may be able to do something similar? I use
> Joda's DateTime class. I ran into a lot of difficulties with the
> serializer, but if you are an admin on the box you'll have less issues by
> adding in some Kryo serializers.
>
> import org.joda.time
>
> val dateFormat = format.DateTimeFormat.forPattern("-MM-dd");
> val tranDate = dateFormat.parseDateTime(someDateString)
>
>
> Alex
>
> -Original Message-
> From: Andrew Holway [mailto:andrew.hol...@otternetworks.de]
> Sent: 21 January 2016 19:25
> To: user@spark.apache.org
> Subject: Date / time stuff with spark.
>
> Hello,
>
> I am importing this data from HDFS into a data frame with
> sqlContext.read.json().
>
> {“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
> "CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
> "2016-01-20T23:59:53+00:00”}
>
> I want to do some date/time operations on this json data but I cannot find
> clear documentation on how to
>
> A) specify the “time” field as a date/time in the schema.
> B) the format the date should be in to be correctly in the raw data for an
> easy import.
>
> Cheers,
>
> Andrew
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If
> you
> think someone may have interfered with this email, please get in touch
> with the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If
> you've
> received this email by mistake, please let the sender know at once that
> it's
> gone to the wrong person and then destroy it without copying, using, or
> telling
> anyone about its contents.
> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc
> Reg.
> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London
> NW1 3AN.
> Registered in England. www.santander.co.uk. Authorised by the Prudential
> Regulation Authority and regulated by the Financial Conduct Authority and
> the
> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003
> respectively.
> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited
> Reg. No.
> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay,
> Essex
> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA
> Reg.
> No. 154210. You can check this on the Financial Services Register by
> visiting
> the FCA’s website www.fca.org.uk/register or by contacting the FCA on
> 0800 111
> 6768. Santander UK plc is also licensed by the Financial Supervision
> Commission
> of the Isle of Man for its branch in the Isle of Man. Deposits held with
> the
> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme
> Regulations
> 2010. In the Isle of Man, Santander UK plc’s principal place of business
> is at
> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the
> flame logo
> are registered trademarks.
> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
> Corporate & Commercial is a brand name used by Santander UK plc, Abbey
> National
> Treasury Services plc and Santander Asset Finance plc.
> Ref:[PDB#1-4A]
>


Re: has any one implemented TF_IDF using ML transformers?

2016-01-22 Thread Andy Davidson
Hi Yanbo

I recently code up the trivial example from
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica
tion-1.html I do not get the same results. I’ll put my code up on github
over the weekend if anyone is interested

Andy

From:  Yanbo Liang 
Date:  Tuesday, January 19, 2016 at 1:11 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: has any one implemented TF_IDF using ML transformers?

> Hi Andy,
> 
> The equation to calculate IDF is:
> idf = log((m + 1) / (d(t) + 1))
> you can refer here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp
> ark/mllib/feature/IDF.scala#L150
> 
> The equation to calculate TFIDF is:
> TFIDF=TF * IDF
> you can refer: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp
> ark/mllib/feature/IDF.scala#L226
> 
> 
> Thanks
> Yanbo
> 
> 2016-01-19 7:05 GMT+08:00 Andy Davidson :
>> Hi Yanbo
>> 
>> I am using 1.6.0. I am having a hard of time trying to figure out what the
>> exact equation is. I do not know Scala.
>> 
>> I took a look a the source code URL  you provide. I do not know Scala
>> 
>>   override def transform(dataset: DataFrame): DataFrame = {
>> transformSchema(dataset.schema, logging = true)
>> val idf = udf { vec: Vector => idfModel.transform(vec) }
>> dataset.withColumn($(outputCol), idf(col($(inputCol
>>   }
>> 
>> 
>> You mentioned the doc is out of date.
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>> 
>> Based on my understanding of the subject matter the equations in the java doc
>> are correct. I could not find anything like the equations in the source code?
>> 
>> IDF(t,D)=log|D|+1DF(t,D)+1,
>> 
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>> 
>> 
>> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite the
>> results do not match equation. (In general the unit test asserts seem
>> incomplete). 
>> 
>> 
>>  I have created several small test example to try and figure out how to use
>> NaiveBase, HashingTF, and IDF. The values of TFIDF,  theta, probabilities , …
>> The result produced by spark not match the published results at
>> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classificat
>> ion-1.html
>> 
>> 
>> Kind regards
>> 
>> Andy 
>> 
>> private DataFrame createTrainingData() {
>> 
>> // make sure we only use dictionarySize words
>> 
>> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(
>> 
>> // 0 is Chinese
>> 
>> // 1 in notChinese
>> 
>> RowFactory.create(0, 0.0, Arrays.asList("Chinese", "Beijing",
>> "Chinese")),
>> 
>> RowFactory.create(1, 0.0, Arrays.asList("Chinese", "Chinese",
>> "Shanghai")),
>> 
>> RowFactory.create(2, 0.0, Arrays.asList("Chinese", "Macao")),
>> 
>> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
>> "Chinese";
>> 
>>
>> 
>> return createData(rdd);
>> 
>> }
>> 
>> 
>> 
>> private DataFrame createData(JavaRDD rdd) {
>> 
>> StructField id = null;
>> 
>> id = new StructField("id", DataTypes.IntegerType, false,
>> Metadata.empty());
>> 
>> 
>> 
>> StructField label = null;
>> 
>> label = new StructField("label", DataTypes.DoubleType, false,
>> Metadata.empty());
>> 
>>
>> 
>> StructField words = null;
>> 
>> words = new StructField("words",
>> DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty());
>> 
>> 
>> 
>> StructType schema = new StructType(new StructField[] { id, label,
>> words });
>> 
>> DataFrame ret = sqlContext.createDataFrame(rdd, schema);
>> 
>> 
>> 
>> return ret;
>> 
>> }
>> 
>> 
>> 
>>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>> 
>> HashingTF hashingTF = new HashingTF()
>> 
>> .setInputCol("words")
>> 
>> .setOutputCol("tf")
>> 
>> .setNumFeatures(dictionarySize);
>> 
>> 
>> 
>> DataFrame termFrequenceDF = hashingTF.transform(rawDF);
>> 
>> 
>> 
>> termFrequenceDF.cache(); // idf needs to make 2 passes over data set
>> 
>> //val idf = new IDF(minDocFreq = 2).fit(tf)
>> 
>> IDFModel idf = new IDF()
>> 
>> //.setMinDocFreq(1) // our vocabulary has 6 words we
>> hash into 7
>> 
>> .setInputCol(hashingTF.getOutputCol())
>> 
>> .setOutputCol("idf")
>> 
>> .fit(termFrequenceDF);
>> 
>> 
>> 
>> DataFrame ret = idf.transform(termFrequenceDF);
>> 
>> 
>> 
>> return ret;
>> 
>> }
>> 
>> 
>> 
>> |-- id: integer (nullable = false)
>> 
>>  |-- label: double (nullable = false)

Tool for Visualization /Plotting of K means cluster

2016-01-22 Thread Ashutosh Kumar
I am looking for any easy to use visualization tool  for KMeansModel
produced as a result of clustering .

Thanks
Ashutosh


Re: 10hrs of Scheduler Delay

2016-01-22 Thread Muthu Jayakumar
If you turn on config (like "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
you would be able to see why some job run for a long time.
The tuning guide (http://spark.apache.org/docs/latest/tuning.html) provides
some insight on this. Setting up explicit partition helped in my case when
I was using RDD.

Hope this helps.

On Fri, Jan 22, 2016 at 1:51 PM, Darren Govoni  wrote:

> Thanks for the tip. I will try it. But this is the kind of thing spark is
> supposed to figure out and handle. Or at least not get stuck forever.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Muthu Jayakumar 
> Date: 01/22/2016 3:50 PM (GMT-05:00)
> To: Darren Govoni , "Sanders, Isaac B" <
> sande...@rose-hulman.edu>, Ted Yu 
> Cc: user@spark.apache.org
> Subject: Re: 10hrs of Scheduler Delay
>
> Does increasing the number of partition helps? You could try out something
> 3 times what you currently have.
> Another trick i used was to partition the problem into multiple dataframes
> and run them sequentially and persistent the result and then run a union on
> the results.
>
> Hope this helps.
>
> On Fri, Jan 22, 2016, 3:48 AM Darren Govoni  wrote:
>
>> Me too. I had to shrink my dataset to get it to work. For us at least
>> Spark seems to have scaling issues.
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: "Sanders, Isaac B" 
>> Date: 01/21/2016 11:18 PM (GMT-05:00)
>> To: Ted Yu 
>> Cc: user@spark.apache.org
>> Subject: Re: 10hrs of Scheduler Delay
>>
>> I have run the driver on a smaller dataset (k=2, n=5000) and it worked
>> quickly and didn’t hang like this. This dataset is closer to k=10, n=4.4m,
>> but I am using more resources on this one.
>>
>> - Isaac
>>
>> On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:
>>
>> You may have seen the following on github page:
>>
>> Latest commit 50fdf0e  on Feb 22, 2015
>>
>> That was 11 months ago.
>>
>> Can you search for similar algorithm which runs on Spark and is newer ?
>>
>> If nothing found, consider running the tests coming from the project to
>> determine whether the delay is intrinsic.
>>
>> Cheers
>>
>> On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B <
>> sande...@rose-hulman.edu> wrote:
>>
>>> That thread seems to be moving, it oscillates between a few different
>>> traces… Maybe it is working. It seems odd that it would take that long.
>>>
>>> This is 3rd party code, and after looking at some of it, I think it
>>> might not be as Spark-y as it could be.
>>>
>>> I linked it below. I don’t know a lot about spark, so it might be fine,
>>> but I have my suspicions.
>>>
>>>
>>> https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala
>>>
>>> - Isaac
>>>
>>> On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:
>>>
>>> You may have noticed the following - did this indicate prolonged
>>> computation in your code ?
>>>
>>> org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
>>> org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
>>> org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
>>> org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)
>>>
>>>
>>> On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B <
>>> sande...@rose-hulman.edu> wrote:
>>>
 Hadoop is: HDP 2.3.2.0-2950

 Here is a gist (pastebin) of my versions en masse and a stacktrace:
 https://gist.github.com/isaacsanders/2e59131758469097651b

 Thanks

 On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:

 Looks like you were running on YARN.

 What hadoop version are you using ?

 Can you capture a few stack traces of the AppMaster during the delay
 and pastebin them ?

 Thanks

 On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B <
 sande...@rose-hulman.edu> wrote:

> The Spark Version is 1.4.1
>
> The logs are full of standard fair, nothing like an exception or even
> interesting [INFO] lines.
>
> Here is the script I am using:
> https://gist.github.com/isaacsanders/660f480810fbc07d4df2
>
> Thanks
> Isaac
>
> On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:
>
> Can you provide a bit more information ?
>
> command line for submitting Spark job
> version of Spark
> anything interesting from driver / executor logs ?
>
> Thanks
>
> On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B <
> sande...@rose-hulman.edu> wrote:
>
>> Hey all,

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Vivek:
I searched for 'cassandra gc pause' and found a few hits.
e.g. :
http://search-hadoop.com/m/qZFqM1c5nrn1Ihwf6=Re+GC+pauses+affecting+entire+cluster+

Keep in mind the effect of GC on shared nodes.

FYI

On Fri, Jan 22, 2016 at 7:09 PM, Mohammed Guller 
wrote:

> For data locality, it is recommended to run the Spark workers and
> Cassandra on the same nodes.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* vivek.meghanat...@wipro.com [mailto:vivek.meghanat...@wipro.com]
> *Sent:* Friday, January 22, 2016 5:38 PM
> *To:* user@spark.apache.org
> *Subject:* Spark Cassandra clusters
>
>
>
> Hi All,
> What is the right spark Cassandra cluster setup - having Cassandra cluster
> and spark cluster in different nodes or they should be on same nodes.
> We are having them in different nodes and performance test shows very bad
> result for the spark streaming jobs.
> Please let us know.
>
> Regards
> Vivek
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


Spark LDA

2016-01-22 Thread Ilya Ganelin
Hi all - I'm running the Spark LDA algorithm on a dataset of roughly 3
million terms with a resulting RDD of approximately 20 GB on a 5 node
cluster with 10 executors (3 cores each) and 14gb of memory per executor.

As the application runs, I'm seeing progressively longer execution times
for the mapPartitions stage (18s - 56s - 3.4min) being caused by
progressively longer shuffle read times. Is there any way to speed up to
tune this out? My configs are below.

screen spark-shell --driver-memory 15g --num-executors 10 --executor-cores
3
--conf "spark.executor.memory=14g"
--conf "spark.io.compression.codec=lz4"
--conf "spark.shuffle.consolidateFiles=true"
--conf "spark.dynamicAllocation.enabled=false"
--conf "spark.shuffle.manager=tungsten-sort"
--conf "spark.akka.frameSize=1028"
--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client

-Ilya Ganelin


Re: Spark Cassandra clusters

2016-01-22 Thread Durgesh Verma
This may be useful, you can try connectors.
https://academy.datastax.com/demos/getting-started-apache-spark-and-cassandra

https://spark-summit.org/2015/events/cassandra-and-spark-optimizing-for-data-locality/

Thanks,
-Durgesh

> On Jan 22, 2016, at 8:37 PM,  
>  wrote:
> 
> Hi All,
> What is the right spark Cassandra cluster setup - having Cassandra cluster 
> and spark cluster in different nodes or they should be on same nodes.
> We are having them in different nodes and performance test shows very bad 
> result for the spark streaming jobs.
> Please let us know.
> 
> Regards
> Vivek
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email. www.wipro.com


Re: Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
+ spark standalone cluster

On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) 
> wrote:


We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory. 
10 nodes for spark another 9nodes for Cassandra.
We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x Cassandra).
Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the 
default setting of Cassandra, should we be increasing the JVM memory?

we have 9 streaming jobs the core usage varies from 2-6 and memory usage from 1 
- 4 gb.

We have budget to use higher CPU or higher memory systems hence was planning to 
have them together on more efficient nodes.

Regards
Vivek
On Sat, Jan 23, 2016 at 7:13 am, Ted Yu 
> wrote:

Can you give us a bit more information ?

How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using

Thanks

On Fri, Jan 22, 2016 at 5:37 PM, 
> wrote:

Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and 
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad 
result for the spark streaming jobs.
Please let us know.

Regards
Vivek


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
>From your description, putting Cassandra daemon on Spark cluster should be
feasible.

One aspect to be measured is how much locality can be achieved in this
setup - Cassandra is distributed NoSQL store.

Cheers

On Fri, Jan 22, 2016 at 6:13 PM,  wrote:

> + spark standalone cluster
> On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) <
> vivek.meghanat...@wipro.com> wrote:
>
> We have the setup on Google cloud platform. Each node has 8 CPU + 30GB
> memory. 10 nodes for spark another 9nodes for Cassandra.
> We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x
> Cassandra).
> Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the
> default setting of Cassandra, should we be increasing the JVM memory?
>
> we have 9 streaming jobs the core usage varies from 2-6 and memory usage
> from 1 - 4 gb.
>
> We have budget to use higher CPU or higher memory systems hence was
> planning to have them together on more efficient nodes.
>
> Regards
> Vivek
> On Sat, Jan 23, 2016 at 7:13 am, Ted Yu  wrote:
>
> Can you give us a bit more information ?
>
> How much memory does each node have ?
> What's the current heap allocation for Cassandra process and executor ?
> Spark / Cassandra release you are using
>
> Thanks
>
> On Fri, Jan 22, 2016 at 5:37 PM,  wrote:
>
>> Hi All,
>> What is the right spark Cassandra cluster setup - having Cassandra
>> cluster and spark cluster in different nodes or they should be on same
>> nodes.
>> We are having them in different nodes and performance test shows very bad
>> result for the spark streaming jobs.
>> Please let us know.
>>
>> Regards
>> Vivek
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) and
>> may contain proprietary, confidential or privileged information. If you are
>> not the intended recipient, you should not disseminate, distribute or copy
>> this e-mail. Please notify the sender immediately and destroy all copies of
>> this message and any attachments. WARNING: Computer viruses can be
>> transmitted via email. The recipient should check this email and any
>> attachments for the presence of viruses. The company accepts no liability
>> for any damage caused by any virus transmitted by this email.
>> www.wipro.com
>>
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


Re: Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
Thanks Ted, also what is the suggested memory setting for Cassandra process?

Regards
Vivek

On Sat, Jan 23, 2016 at 7:57 am, Ted Yu 
> wrote:

>From your description, putting Cassandra daemon on Spark cluster should be 
>feasible.

One aspect to be measured is how much locality can be achieved in this setup - 
Cassandra is distributed NoSQL store.

Cheers

On Fri, Jan 22, 2016 at 6:13 PM, 
> wrote:

+ spark standalone cluster

On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) 
> wrote:


We have the setup on Google cloud platform. Each node has 8 CPU + 30GB memory. 
10 nodes for spark another 9nodes for Cassandra.
We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x Cassandra).
Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the 
default setting of Cassandra, should we be increasing the JVM memory?

we have 9 streaming jobs the core usage varies from 2-6 and memory usage from 1 
- 4 gb.

We have budget to use higher CPU or higher memory systems hence was planning to 
have them together on more efficient nodes.

Regards
Vivek
On Sat, Jan 23, 2016 at 7:13 am, Ted Yu 
> wrote:

Can you give us a bit more information ?

How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using

Thanks

On Fri, Jan 22, 2016 at 5:37 PM, 
> wrote:

Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and 
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad 
result for the spark streaming jobs.
Please let us know.

Regards
Vivek


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
I am not Cassandra developer :-)

Can you use http://search-hadoop.com/ or ask on Cassandra mailing list.

Cheers

On Fri, Jan 22, 2016 at 6:35 PM,  wrote:

> Thanks Ted, also what is the suggested memory setting for Cassandra
> process?
>
> Regards
> Vivek
> On Sat, Jan 23, 2016 at 7:57 am, Ted Yu  wrote:
>
> From your description, putting Cassandra daemon on Spark cluster should
> be feasible.
>
> One aspect to be measured is how much locality can be achieved in this
> setup - Cassandra is distributed NoSQL store.
>
> Cheers
>
> On Fri, Jan 22, 2016 at 6:13 PM,  wrote:
>
>> + spark standalone cluster
>> On Sat, Jan 23, 2016 at 7:33 am, Vivek Meghanathan (WT01 - NEP) <
>> vivek.meghanat...@wipro.com> wrote:
>>
>> We have the setup on Google cloud platform. Each node has 8 CPU + 30GB
>> memory. 10 nodes for spark another 9nodes for Cassandra.
>> We are using spark 1.3.0 and Datastax bundle 4.5.9(which has 2.0.x
>> Cassandra).
>> Spark master and worker daemon uses Xmx & Xms 4G. We have not changed the
>> default setting of Cassandra, should we be increasing the JVM memory?
>>
>> we have 9 streaming jobs the core usage varies from 2-6 and memory usage
>> from 1 - 4 gb.
>>
>> We have budget to use higher CPU or higher memory systems hence was
>> planning to have them together on more efficient nodes.
>>
>> Regards
>> Vivek
>> On Sat, Jan 23, 2016 at 7:13 am, Ted Yu  wrote:
>>
>> Can you give us a bit more information ?
>>
>> How much memory does each node have ?
>> What's the current heap allocation for Cassandra process and executor ?
>> Spark / Cassandra release you are using
>>
>> Thanks
>>
>> On Fri, Jan 22, 2016 at 5:37 PM,  wrote:
>>
>>> Hi All,
>>> What is the right spark Cassandra cluster setup - having Cassandra
>>> cluster and spark cluster in different nodes or they should be on same
>>> nodes.
>>> We are having them in different nodes and performance test shows very
>>> bad result for the spark streaming jobs.
>>> Please let us know.
>>>
>>> Regards
>>> Vivek
>>>
>>> The information contained in this electronic message and any attachments
>>> to this message are intended for the exclusive use of the addressee(s) and
>>> may contain proprietary, confidential or privileged information. If you are
>>> not the intended recipient, you should not disseminate, distribute or copy
>>> this e-mail. Please notify the sender immediately and destroy all copies of
>>> this message and any attachments. WARNING: Computer viruses can be
>>> transmitted via email. The recipient should check this email and any
>>> attachments for the presence of viruses. The company accepts no liability
>>> for any damage caused by any virus transmitted by this email.
>>> www.wipro.com
>>>
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) and
>> may contain proprietary, confidential or privileged information. If you are
>> not the intended recipient, you should not disseminate, distribute or copy
>> this e-mail. Please notify the sender immediately and destroy all copies of
>> this message and any attachments. WARNING: Computer viruses can be
>> transmitted via email. The recipient should check this email and any
>> attachments for the presence of viruses. The company accepts no liability
>> for any damage caused by any virus transmitted by this email.
>> www.wipro.com
>>
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


RE: Date / time stuff with spark.

2016-01-22 Thread Mohammed Guller
Hi Andrew,

Here is another option.

You can define custom schema to specify the correct type for the time column as 
shown below:

import org.apache.spark.sql.types._

val customSchema =
  StructType(
StructField("a", IntegerType, false) ::
StructField("b", LongType, false) ::
StructField("Id", StringType, false) ::
StructField("smunkId", StringType, false) ::
StructField("popsicleRange", LongType, false) ::
StructField("time", TimestampType, false) :: Nil
)

val df = sqlContext.read.schema(customSchema).json("...")

You can also use the built-in date/time functions to manipulate the time column 
as shown below.

val day = df.select(dayofmonth($"time"))
val mth = df.select(month($"time"))
val prevYear = df.select(year($"time") - 1)

You can get the complete list of the date/time functions here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Mohammed
Author: Big Data Analytics with 
Spark

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, January 22, 2016 8:01 AM
To: Spencer, Alex (Santander)
Cc: Andrew Holway; user@spark.apache.org
Subject: Re: Date / time stuff with spark.

Related thread:

http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime

FYI

On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) 
>
 wrote:
Hi Andy,

Sorry this is in Scala but you may be able to do something similar? I use 
Joda's DateTime class. I ran into a lot of difficulties with the serializer, 
but if you are an admin on the box you'll have less issues by adding in some 
Kryo serializers.

import org.joda.time

val dateFormat = format.DateTimeFormat.forPattern("-MM-dd");
val tranDate = dateFormat.parseDateTime(someDateString)


Alex

-Original Message-
From: Andrew Holway 
[mailto:andrew.hol...@otternetworks.de]
Sent: 21 January 2016 19:25
To: user@spark.apache.org
Subject: Date / time stuff with spark.

Hello,

I am importing this data from HDFS into a data frame with 
sqlContext.read.json().

{“a": 42, “a": 56, "Id": "621368e2f829f230", “smunkId":
"CKm26sDMucoCFReRGwodbHAAgw", “popsicleRange": "17610", "time":
"2016-01-20T23:59:53+00:00”}

I want to do some date/time operations on this json data but I cannot find 
clear documentation on how to

A) specify the “time” field as a date/time in the schema.
B) the format the date should be in to be correctly in the raw data for an easy 
import.

Cheers,

Andrew

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org
Emails aren't always secure, and they may be intercepted or changed after
they've been sent. Santander doesn't accept liability if this happens. If you
think someone may have interfered with this email, please get in touch with the
sender another way. This message doesn't create or change any contract.
Santander doesn't accept responsibility for damage caused by any viruses
contained in this email or its attachments. Emails may be monitored. If you've
received this email by mistake, please let the sender know at once that it's
gone to the wrong person and then destroy it without copying, using, or telling
anyone about its contents.
Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc Reg.
No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 3AN.
Registered in England. www.santander.co.uk. 
Authorised by the Prudential
Regulation Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. No.
02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA Reg.
No. 154210. You can check this on the Financial Services Register by visiting
the FCA’s website www.fca.org.uk/register or by 
contacting the FCA on 0800 111
6768. Santander UK plc is also licensed by the Financial Supervision Commission
of the Isle of Man for its branch in the Isle of Man. Deposits held with the
Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
Scheme as set out in the Isle of Man Depositors’ Compensation Scheme Regulations
2010. In the Isle of Man, Santander UK plc’s principal place of business is at
19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame logo
are registered trademarks.
Santander Asset Finance plc. Reg. No. 1533123. Registered 

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Can you give us a bit more information ?

How much memory does each node have ?
What's the current heap allocation for Cassandra process and executor ?
Spark / Cassandra release you are using

Thanks

On Fri, Jan 22, 2016 at 5:37 PM,  wrote:

> Hi All,
> What is the right spark Cassandra cluster setup - having Cassandra cluster
> and spark cluster in different nodes or they should be on same nodes.
> We are having them in different nodes and performance test shows very bad
> result for the spark streaming jobs.
> Please let us know.
>
> Regards
> Vivek
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


Spark Cassandra clusters

2016-01-22 Thread vivek.meghanathan
Hi All,
What is the right spark Cassandra cluster setup - having Cassandra cluster and 
spark cluster in different nodes or they should be on same nodes.
We are having them in different nodes and performance test shows very bad 
result for the spark streaming jobs.
Please let us know.

Regards
Vivek


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com