Re: RDD boundaries and triggering processing using tags in the data

2015-06-01 Thread Akhil Das
May be you can make use of the Window operations
,
Also another approach would be to keep your incoming data in
Hbase/Redis/Cassandra kind of database and then whenever you need to
average it, you just query the database and average it.

Thanks
Best Regards

On Thu, May 28, 2015 at 1:22 AM, David Webber 
wrote:

> Hi All,
>
> I'm new to Spark and I'd like some help understanding if a particular use
> case would be a good fit for Spark Streaming.
>
> I have an imaginary stream of sensor data consisting of integers 1-10.
> Every time the sensor reads "10" I'd like to average all the numbers that
> were received since the last "10"
>
> example input: 10 5 8 4 6 2 1 2 8 8 8 1 6 9 1 3 10 1 3 10 ...
> desired output: 4.8, 2.0
>
> I'm confused about what happens if sensor readings fall into different
> RDDs.
>
> RDD1:  10 5 8 4 6 2 1 2 8 8 8
> RDD2:  1 6 9 1 3 10 1 3 10
> output: ???, 2.0
>
> My imaginary sensor doesn't read at fixed time intervals, so breaking the
> stream into RDDs by time interval won't ensure the data is packaged
> properly.  Additionally, multiple sensors are writing to the same stream
> (though I think flatMap can parse the origin stream into streams for
> individual sensors, correct?).
>
> My best guess for processing goes like
> 1) flatMap() to break out individual sensor streams
> 2) Custom parser to accumulate input data until "10" is found, then create
> a
> new output RDD for each sensor and data grouping
> 3) average the values from step 2
>
> I would greatly appreciate pointers to some specific documentation or
> examples if you have seen something like this before.
>
> Thanks,
> David
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-boundaries-and-triggering-processing-using-tags-in-the-data-tp23060.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Akhil Das
This thread

has various methods on accessing S3 from spark, it might help you.

Thanks
Best Regards

On Sun, May 24, 2015 at 8:03 AM, ogoh  wrote:

>
> Hello,
> I am using Spark1.3 in AWS.
> SparkSQL can't recognize Hive external table on S3.
> The following is the error message.
> I appreciate any help.
> Thanks,
> Okehee
> --
> 15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select
> count(*) from api_search where pdate='2015-05-08']
> java.lang.IllegalArgumentException: Wrong FS:
>
> s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00,
> expected: hdfs://10.128.193.211:9000
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
> at
> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467)
> at
>
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
> at
>
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Windows of windowed streams not displaying the expected results

2015-06-01 Thread DMiner
Yes, I also met this issue. And wanna check if you fixed this issue or do you
have other solution for the same goal.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-of-windowed-streams-not-displaying-the-expected-results-tp466p23096.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Don't understand "schedule jobs within an Application

2015-06-01 Thread bit1...@163.com
Hi, sparks,

Following is copied from the spark online document 
http://spark.apache.org/docs/latest/job-scheduling.html. 

Basically, I have two questions on it:

1. If two jobs in an application has dependencies, that is one job depends on 
the result of the other job, then I think they will have to run sequentially.
2. Since jobs scheduling happens within one application, I don't think job 
scheduing will give benefits to  multi-users as the last sentence says.in  my 
opinion, multi users can benifit only from cross applications scheduling.

Maybe i haven't had a good understanding on the job scheduing, could someone 
elaborate this? Thanks very much






By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided 
into “stages” (e.g. map and reduce phases), and the first job gets priority on 
all available resources while its stages have tasks to launch, then the second 
job gets priority, etc. If the jobs at the head of the queue don’t need to use 
the whole cluster, later jobs can start to run right away, but if the jobs at 
the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between 
jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” 
fashion, so that all jobs get a roughly equal share of cluster resources. This 
means that short jobs submitted while a long job is running can start receiving 
resources right away and still get good response times, without waiting for the 
long job to finish. This mode is best for multi-user settings




bit1...@163.com


RE: Spark Executor Memory Usage

2015-06-01 Thread HuS . Andy
#1 I not sure if I got you point, as I known, Xmx is not turn into physical 
memory as soon as the process running. it first loaded into virtual memory, if 
you heap is need more, it will gradually increase in physical memory until to 
the max heap.
#2 Physical memory contains not only heap, but also stack, direct memory, 
shared lib, and perm space, and also there have VSS, RSS, PSS, USS concept, you 
can google. 
simple says:Vss = virtual set sizeRss = resident set sizePss = proportional set 
size Uss = unique set size
Best Regards,Andy Hu(胡 珊)

Date: Fri, 29 May 2015 07:41:41 -0700
Subject: Re: Spark Executor Memory Usage
From: yuzhih...@gmail.com
To: valeramoisee...@gmail.com
CC: user@spark.apache.org

For #2, see 
http://unix.stackexchange.com/questions/65835/htop-reporting-much-higher-memory-usage-than-free-or-top
Cheers
On Fri, May 29, 2015 at 6:56 AM, Valerii Moisieienko 
 wrote:
Hello!

My name is Valerii. I have noticed strange memory behaivour of Spark's

executor on my cluster. Cluster works in standalone mode with 3 workers.

Application runs in cluster mode.

From topology configuration

spark.executor.memory  1536m

I checked heap usage via JVisualVM:

http://joxi.ru/Q2KqBMdSvYpDrj

and via htop:

http://joxi.ru/Vm63RWeCvG6L2Z



I have 2 questions regarding Spark's executors memory usage:

1. Why does Max Heap Size change during executor work?

2. Why does Memory usage via htop greater than executor's heap size?



Thank you!









--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Executor-Memory-Usage-tp23083.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org




  

Cassanda example

2015-06-01 Thread Yasemin Kaya
Hi,

I want to write my RDD to Cassandra database and I took an example from
this site
.
I add that to my project but I have errors. Here is my project in gist
.

errors :

   - At line 40 (can not recognize Session)
   - At line 106 (flatmap is not applicaple)

Have a nice day
yasemin

-- 
hiç ender hiç


RE: FW: Websphere MQ as a data source for Apache Spark Streaming

2015-06-01 Thread Chaudhary, Umesh
Thanks  for your  suggestion.
Yes by Dstream.SaveAsTextFile();
I was doing a mistake by writing StorageLevel.NULL while overriding the 
storageLevel method in my custom receiver.
When I changed it to StorageLevel.MEMORY_AND_DISK_2() , data started to save at 
disk.
Now it’s running without any issue.


From: Tathagata Das [mailto:t...@databricks.com]
Sent: Friday, May 29, 2015 3:30 AM
To: Chaudhary, Umesh
Cc: Arush Kharbanda; user@spark.apache.org
Subject: Re: FW: Websphere MQ as a data source for Apache Spark Streaming

Are you sure that the data can be saved as strings?
Another, more controlled approach is use DStream.foreachRDD , which takes a 
Function2 parameter, RDD and Time. There you can explicitly do stuff with the 
RDD, save it to separate files (separated by time), or whatever.  Might help 
you to debug what is going on.
Might help if you shows the streaming program in a pastebin.

TD


On Fri, May 29, 2015 at 12:55 AM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
Hi,
I have written  manual receiver for Websphere MQ and its working fine.
If I am doing JavaDStream.SaveAsTextFile(“/home/user/out.txt”)  then its 
generating a directory naming out.txt appending its timestamp.
In this directory only _SUCCESS file is present. I can see data on console 
while running in local mode but not able to save it as text file.
Is there any other way for saving streaming data?

From: Chaudhary, Umesh
Sent: Tuesday, May 26, 2015 2:39 AM
To: 'Arush Kharbanda'; user@spark.apache.org
Subject: RE: Websphere MQ as a data source for Apache Spark Streaming

Thanks for the suggestion, I will try and post the outcome.

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can write a customer receiver for Websphere MQ, using the API for websphere 
MQ.

https://spark.apache.org/docs/latest/streaming-custom-receivers.html

Thanks
Arush

On Mon, May 25, 2015 at 8:04 PM, Chaudhary, Umesh 
mailto:umesh.chaudh...@searshc.com>> wrote:
I have seen it but it has different configuration for connecting the MQ.
I mean for Websphere MQ we need Host, Queue Manager, Channel And Queue Name but 
here according to MQTT protocol

client = new MqttClient(brokerUrl, MqttClient.generateClientId(), persistence)

It only expects Broker URL which is in appropriate for establishing connection 
with Websphere MQ.

Please Suggest !


From: Arush Kharbanda 
[mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 6:29 AM
To: Chaudhary, Umesh
Cc: user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming

Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 
mailto:umesh.chaudh...@searshc.com>> wrote:
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that  MQTT    is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I need some working examples for the same. Did anyone try to
connect the MQ with spark streaming. Please devise the best way for doing
so.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



--

[Image removed by sender. Sigmoid 
Analytics]

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com
This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.



--

[Image removed by sender. Sigmoid 
Analytics]

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || 
www.sigmoidanalytics.com
This message, including any attachments, is the property of Sears Hol

Re: Cassanda example

2015-06-01 Thread Akhil Das
Here's a more detailed documentation
 from Datastax, You
can also shoot an email directly to their mailing list

since its more related to their code.

Thanks
Best Regards

On Mon, Jun 1, 2015 at 2:18 PM, Yasemin Kaya  wrote:

> Hi,
>
> I want to write my RDD to Cassandra database and I took an example from
> this site
> .
> I add that to my project but I have errors. Here is my project in gist
> .
>
> errors :
>
>- At line 40 (can not recognize Session)
>- At line 106 (flatmap is not applicaple)
>
> Have a nice day
> yasemin
>
> --
> hiç ender hiç
>


Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Helena Edelson
Hi Antonio,
 
First, what version of the Spark Cassandra Connector are you using? You are 
using Spark 1.3.1, which the Cassandra connector today supports in builds from 
the master branch only - the release with public artifacts supporting Spark 
1.3.1 is coming soon ;)
Please see 
https://github.com/datastax/spark-cassandra-connector#version-compatibility 
 
Try the version change and LMK.

What does your cassandra log say?

Note that you can read from a Spark stream like Flume, for instance in your 
flumeStreamNavig.map(..) code (in scala at least, with a lot less code - I have 
not used java)
(here it’s kafka) 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L39
 

And write inline to Cassandra 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L45
 

 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L64
 

 

Helena
tw: @helenaedelson


> On May 29, 2015, at 6:11 AM, Antonio Giambanco  wrote:
> 
> Hi all,
> I have in a single server installed spark 1.3.1 and cassandra 2.0.14
> I'm coding a simple java class for Spark Streaming as follow:
> reading header events from flume sink
> based on header I write the event body on navigation or transaction table 
> (cassandra)
> unfortunatly I get NoHostAvailableException, if I comment the code for saving 
> one of the two tables everything works
> 
> 
> here the code
> 
>  public static void main(String[] args) {
>  
> // Create a local StreamingContext with two working thread and batch 
> interval of 1 second
>  SparkConf conf = new 
> SparkConf().setMaster("local[2]").setAppName("DWXNavigationApp");
>  
>  conf.set("spark.cassandra.connection.host", "127.0.0.1");
>  conf.set("spark.cassandra.connection.native.port","9042");
>  conf.set("spark.cassandra.output.batch.size.rows", "1");
>  conf.set("spark.cassandra.output.concurrent.writes", "1");
>  
>  
>  final JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(1));
>  
>  JavaReceiverInputDStream flumeStreamNavig = 
> FlumeUtils.createPollingStream(jssc, "127.0.0.1", ); 
>  
> 
>  JavaDStream logRowsNavig = flumeStreamNavig.map(
>  new Function(){
> 
> @Override
> public String call(SparkFlumeEvent arg0) throws Exception 
> {
> // TODO Auto-generated method stub0.
> 
> Map headers = 
> arg0.event().getHeaders();
> 
> ByteBuffer bytePayload = arg0.event().getBody(); 
> String s = headers.get("source_log").toString() + "#" 
> + new String(bytePayload.array());
> System.out.println("RIGA: " + s);
> return s;
> }
>  });
>  
> 
>  logRowsNavig.foreachRDD(
>  new Function,Void>(){
> @Override
> public Void call(JavaRDD rows) throws Exception {
> 
> if(!rows.isEmpty()){
>  
>  //String header = 
> getHeaderFronRow(rows.collect());
>  
>  List listNavigation = new 
> ArrayList();
>  List listTransaction = new 
> ArrayList();
>  
>  for(String row : rows.collect()){
>  
>  String header = row.substring(0, 
> row.indexOf("#"));
>  
>  if(header.contains("controller_log")){
>  
> listNavigation.add(createNavigation(row));
>  System.out.println("Added Element in 
> Navigation List");
>  
>  }else if(header.contains("business_log")){
>  
> li

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Michael Armbrust
This sounds like a problem that was fixed in Spark 1.3.1.

https://issues.apache.org/jira/browse/SPARK-6351

On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das 
wrote:

> This thread
> 
> has various methods on accessing S3 from spark, it might help you.
>
> Thanks
> Best Regards
>
> On Sun, May 24, 2015 at 8:03 AM, ogoh  wrote:
>
>>
>> Hello,
>> I am using Spark1.3 in AWS.
>> SparkSQL can't recognize Hive external table on S3.
>> The following is the error message.
>> I appreciate any help.
>> Thanks,
>> Okehee
>> --
>> 15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select
>> count(*) from api_search where pdate='2015-05-08']
>> java.lang.IllegalArgumentException: Wrong FS:
>>
>> s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00,
>> expected: hdfs://10.128.193.211:9000
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
>> at
>> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467)
>> at
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
>> at
>>
>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


RE: Don't understand "schedule jobs within an Application

2015-06-01 Thread yana
1. Yes if two tasks depend on each other they cant parallelize
2. Imagine something like a web application driver. You only get to have 1 
spark context but now you want to run many concurrent jobs. They have nothing 2 
do with each other; no reason to keep them sequential. 

Hope this helps

 Original message From: bit1...@163.com 
Date:06/01/2015  4:14 AM  (GMT-05:00) To: user 
 Subject: Don't understand "schedule jobs 
within an Application 
Hi, sparks,

Following is copied from the spark online document 
http://spark.apache.org/docs/latest/job-scheduling.html. 

Basically, I have two questions on it:

1. If two jobs in an application has dependencies, that is one job depends on 
the result of the other job, then I think they will have to run sequentially.
2. Since jobs scheduling happens within one application, I don't think job 
scheduing will give benefits to  multi-users as the last sentence says.in  my 
opinion, multi users can benifit only from cross applications scheduling.

Maybe i haven't had a good understanding on the job scheduing, could someone 
elaborate this? Thanks very much






By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided 
into “stages” (e.g. map and reduce phases), and the first job gets priority on 
all available resources while its stages have tasks to launch, then the second 
job gets priority, etc. If the jobs at the head of the queue don’t need to use 
the whole cluster, later jobs can start to run right away, but if the jobs at 
the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between 
jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” 
fashion, so that all jobs get a roughly equal share of cluster resources. This 
means that short jobs submitted while a long job is running can start receiving 
resources right away and still get good response times, without waiting for the 
long job to finish. This mode is best for multi-user settings


bit1...@163.com

Spark stages very slow to complete

2015-06-01 Thread Karlson

Hi,

In all (pyspark) Spark jobs, that become somewhat more involved, I am 
experiencing the issue that some stages take a very long time to 
complete and sometimes don't at all. This clearly correlates with the 
size of my input data. Looking at the stage details for one such stage, 
I am wondering where Spark spends all this time. Take this table of the 
stages task metrics for example:


Metric  Min 25thpercentile  
Median  75th percentile Max
Duration1.4 min 1.5 min 1.7 min 
1.9 min 2.3 min
Scheduler Delay 1 ms3 ms4 ms
5 ms23 ms
Task Deserialization Time   1 ms2 ms3 ms
8 ms22 ms
GC Time 0 ms0 ms0 ms
0 ms0 ms
Result Serialization Time   0 ms0 ms0 ms
0 ms1 ms
Getting Result Time 0 ms0 ms0 ms
0 ms0 ms
Input Size / Records		23.9 KB / 1	24.0 KB / 1	24.1 KB / 1	24.1 KB / 
1	24.3 KB / 1


Why is the overall duration almost 2min? Where is all this time spent, 
when no progress of the stages is visible? The progress bar simply 
displays 0 succeeded tasks for a very long time before sometimes slowly 
progressing.


Also, the name of the stage displayed above is `javaToPython at 
null:-1`, which I find very uninformative. I don't even know which 
action exactly is responsible for this stage. Does anyone experience 
similar issues or have any advice for me?


Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-06-01 Thread ๏̯͡๏
I am seeing the same issue with Spark 1.3.1.

I see this issue when reading sequence file stored in Sequence File format
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
)

All i do is
sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
org.apache.spark.HashPartitioner(2053))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryoserializer.buffer.mb",
arguments.get("buffersize").get)
  .set("spark.kryoserializer.buffer.max.mb",
arguments.get("maxbuffersize").get)
  .set("spark.driver.maxResultSize", arguments.get("maxResultSize").get)
  .set("spark.yarn.maxAppAttempts", "0")
  //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
  //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
  //.set("spark.worker.timeout", arguments.get("workerTimeout").get)

.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


and values are
buffersize=128 maxbuffersize=1068 maxResultSize=200G

On Thu, May 7, 2015 at 8:04 AM, Jianshi Huang 
wrote:

> I'm using the default settings.
>
> Jianshi
>
> On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva <
> twinkle.sachd...@gmail.com> wrote:
>
>> Hi,
>>
>> Can you please share your compression etc settings, which you are using.
>>
>> Thanks,
>> Twinkle
>>
>> On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang 
>> wrote:
>>
>>> I'm facing this error in Spark 1.3.1
>>>
>>>   https://issues.apache.org/jira/browse/SPARK-4105
>>>
>>> Anyone knows what's the workaround? Change the compression codec for
>>> shuffle output?
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Deepak


java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread ๏̯͡๏
Any suggestions ?

I using Spark 1.3.1 to read   sequence file stored in Sequence File format
(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
)

with this code and settings
sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
org.apache.spark.HashPartitioner(2053))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryoserializer.buffer.mb",
arguments.get("buffersize").get)
  .set("spark.kryoserializer.buffer.max.mb",
arguments.get("maxbuffersize").get)
  .set("spark.driver.maxResultSize", arguments.get("maxResultSize").get)
  .set("spark.yarn.maxAppAttempts", "0")
  //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
  //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
  //.set("spark.worker.timeout", arguments.get("workerTimeout").get)

.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))


and values are
buffersize=128 maxbuffersize=1068 maxResultSize=200G


And i see this exception in each executor task

FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
54757), shuffleId=6, mapId=2810, reduceId=1117, message=

org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)

at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)

at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

*Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*

at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)

at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)

at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)

at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)

at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)

at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)

at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)

at
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)

at
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)

at
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)

at
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)

at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)

at scala.util.Try$.apply(Try.scala:161)

at scala.util.Success.map(Try.scala:206)

at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)

at
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)

... 18 more


Streaming K-medoids

2015-06-01 Thread Marko Dinic

Hello everyone,

I have an idea and I would like to get a validation from community about 
it.


In Mahout there is an implementation of Streaming K-means. I'm 
interested in your opinion would it make sense to make a similar 
implementation of Streaming K-medoids?


K-medoids has even bigger problems than K-means because it's not 
scalable, but can be useful in some cases (e.g. It allows more 
sophisticated distance measures).


What is your opinion about such an approach? Does anyone see problems 
with it?


Best regards,
Marko

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



UNSUBSCRIBE

2015-06-01 Thread Rivera, Dario
UNSUBSCRIBE

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark stages very slow to complete

2015-06-01 Thread ayan guha
Would you mind posting the code?
On 2 Jun 2015 00:53, "Karlson"  wrote:

> Hi,
>
> In all (pyspark) Spark jobs, that become somewhat more involved, I am
> experiencing the issue that some stages take a very long time to complete
> and sometimes don't at all. This clearly correlates with the size of my
> input data. Looking at the stage details for one such stage, I am wondering
> where Spark spends all this time. Take this table of the stages task
> metrics for example:
>
> Metric  Min 25th
> percentile  Median  75th percentile Max
> Duration1.4 min 1.5 min 1.7 min
>  1.9 min 2.3 min
> Scheduler Delay 1 ms3 ms4 ms
>   5 ms23 ms
> Task Deserialization Time   1 ms2 ms3 ms
>   8 ms22 ms
> GC Time 0 ms0 ms0 ms
>   0 ms0 ms
> Result Serialization Time   0 ms0 ms0 ms
>   0 ms1 ms
> Getting Result Time 0 ms0 ms0 ms
>   0 ms0 ms
> Input Size / Records23.9 KB / 1 24.0 KB / 1 24.1 KB /
> 1 24.1 KB / 1 24.3 KB / 1
>
> Why is the overall duration almost 2min? Where is all this time spent,
> when no progress of the stages is visible? The progress bar simply displays
> 0 succeeded tasks for a very long time before sometimes slowly progressing.
>
> Also, the name of the stage displayed above is `javaToPython at null:-1`,
> which I find very uninformative. I don't even know which action exactly is
> responsible for this stage. Does anyone experience similar issues or have
> any advice for me?
>
> Thanks!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
Hi,

In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.&title=Application%20history%20not%20found%20(app-20150601160846-1914)

The log does exists though:

# hdfs dfs -ls -R /eventLogs/app-20150601160846-1914

-rw-rw   3 user group1027848 2015-06-01 16:09
/eventLogs/app-20150601160846-1914

and `cat` the file ends with:

{"Event":"SparkListenerApplicationEnd","Timestamp":1433174955077}

This seems to indicate it saw and logged the application end.

Is there a known issue here or a workaround? Looking at the source code I
might have expected these files to end in `.inprogress` given the UI error
message, but they don't.

Thanks,
Richard


Re: Streaming K-medoids

2015-06-01 Thread Erik Erlandson

I haven't given any thought to streaming it, but in case it's useful I do have 
a k-medoids implementation for Spark:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.cluster.KMedoids

Also a blog post about multi-threading it:
http://erikerlandson.github.io/blog/2015/05/06/parallel-k-medoids-using-scala-parseq/



- Original Message -
> Hello everyone,
> 
> I have an idea and I would like to get a validation from community about
> it.
> 
> In Mahout there is an implementation of Streaming K-means. I'm
> interested in your opinion would it make sense to make a similar
> implementation of Streaming K-medoids?
> 
> K-medoids has even bigger problems than K-means because it's not
> scalable, but can be useful in some cases (e.g. It allows more
> sophisticated distance measures).
> 
> What is your opinion about such an approach? Does anyone see problems
> with it?
> 
> Best regards,
> Marko
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Antonio Giambanco
Hi Helena,
thanks for answering me . . .
I didn't realize it could be the connector version, unfortunately i didn't
try yet.
I know scala is better but i'm using drools and i'm forced to use java
in my project i'm using spark-cassandra-connector-java_2.10

from cassandra I have only this log

INFO [ScheduledTasks:1] 2015-06-01 15:49:03,260 ColumnFamilyStore.java
(line 795) Enqueuing flush of Memtable-sstable_activity@361028148(455/4550
serialized/live bytes, 180 ops)
 INFO [FlushWriter:76] 2015-06-01 15:49:03,261 Memtable.java (line 362)
Writing Memtable-sstable_activity@361028148(455/4550 serialized/live bytes,
180 ops)
 INFO [FlushWriter:76] 2015-06-01 15:49:03,273 Memtable.java (line 402)
Completed flushing /var/lib/cassandra/data/system/sstable_activity/
system-sstable_activity-jb-103-Data.db (248 bytes) for commitlog position
ReplayPosition(segmentId=1432896540485, position=1217022)



also on spark I found this exception



15/06/01 16:43:30 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID
81)
java.io.IOException: Failed to prepare statement INSERT INTO
"cassandrasink"."transaction" ("event_id", "isin", "security_type",
"security_name", "date", "time", "price", "currency", "user_id",
"quantity", "amount", "session_id") VALUES (:"event_id", :"isin",
:"security_type", :"security_name", :"date", :"time", :"price",
:"currency", :"user_id", :"quantity", :"amount", :"session_id"): All
host(s) tried for query failed (no host was tried)
at com.datastax.spark.connector.writer.TableWriter.com
$
datastax$spark$connector$writer$TableWriter$$prepareStatement(TableWriter.
scala:96)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$
write$1.apply(TableWriter.scala:122)
at
com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:120)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:100)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:99)
at
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:151)
at
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:99)
at
com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:120)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at
com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.exceptions.
NoHostAvailableException.copy(NoHostAvailableException.java:84)
at com.datastax.driver.core.DefaultResultSetFuture.
extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
at com.datastax.driver.core.AbstractSession.prepare(
AbstractSession.java:91)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
at $Proxy17.prepare(Unknown Source)
at com.datastax.spark.connector.cql.PreparedStatementCache$.
prepareStatement(PreparedStatementCache.scala:45)
at com.datastax.spark.connector.cql.SessionProxy.invoke(
SessionProxy.scala:28)
at $Proxy17.prepare(Unknown Source)
at com.datastax.spark.connector.writer.TableWriter.com
$
datastax$spark$connector$writer$TableWriter$$prepareStatement(TableWriter.
scala:92)
... 15 more
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (no host was tried)
 at
com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:107)
at
com.datastax.driver.core.SessionManager.execute(SessionManager.java:538)
at com.datastax.driver.core.SessionManager.prepareAsync(
SessionManager.java:124)
at com.datastax.driver.core.AbstractSession.prepareAsync(
AbstractSession.java:103)
at com.datastax.driver.core.AbstractSession.prepare(
AbstractSession.java:89)
... 24 more
15/06/0

flatMap output on disk / flatMap memory overhead

2015-06-01 Thread octavian.ganea
Hi,

Is there any way to force the output RDD of a  flatMap op to be stored in
both memory and disk as it is computed ? My RAM would not be able to fit the
entire output of flatMap, so it really needs to starts using disk after the
RAM gets full. I didn't find any way to force this. 

Also, what is the memory overhead of flatMap ? From my computations, the
output RDD should fit in memory, but I get the following error after a while
(and I know it's because of memory issues, since running the program with
1/3 of the input data finishes succesfully)

15/06/01 19:02:49 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Could not get block(s) from
ConnectionManagerId(dco-node036-mgt.dco.ethz.ch,57478)
java.io.IOException: sendMessageReliably failed because ack was not received
within 60 sec
at
org.apache.spark.network.ConnectionManager$$anon$10$$anonfun$run$15.apply(ConnectionManager.scala:866)
at
org.apache.spark.network.ConnectionManager$$anon$10$$anonfun$run$15.apply(ConnectionManager.scala:865)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:865)
at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at java.lang.Thread.run(Thread.java:745)


Also, I've seen also this:
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
but my understanding is that one should apply something like:
rdd.flatMap(...).persist(MEMORY_AND_DISK) which assumes that the entire
output of flatMap is first stored in memory (which is not possible in my
case) and, only when it's done, is stored on the disk. Please correct me if
I'm wrong.  Anways, I've tried using this , but I got the same error.

My config:

conf.set("spark.cores.max", "128")
conf.set("spark.akka.frameSize", "1024")
conf.set("spark.executor.memory", "125g")
conf.set("spark.shuffle.file.buffer.kb", "1000")
conf.set("spark.shuffle.consolidateFiles", "true")



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-tp23098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Helena Edelson
Hi Antonio,

It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1 for 
Spark 1.3 and DataSources API
Give it a little while to propagate to 
http://search.maven.org/#search%7Cga%7C1%7Cspark-cassandra-connector 
 

'spark-cassandra-connector-java_2.10’ just tells me you are using the version 
that is compiled against scala 2.10, what is the actual connector version 
itself?

Thanks,
Helena
@helenaedelson


> On Jun 1, 2015, at 1:08 PM, Antonio Giambanco  wrote:
> 
> Hi Helena,
> thanks for answering me . . . 
> I didn't realize it could be the connector version, unfortunately i didn't 
> try yet.
> I know scala is better but i'm using drools and i'm forced to use java
> in my project i'm using spark-cassandra-connector-java_2.10
> 
> from cassandra I have only this log
> 
> INFO [ScheduledTasks:1] 2015-06-01 15:49:03,260 ColumnFamilyStore.java (line 
> 795) Enqueuing flush of Memtable-sstable_activity@361028148(455/4550 
> serialized/live bytes, 180 ops)
>  INFO [FlushWriter:76] 2015-06-01 15:49:03,261 Memtable.java (line 362) 
> Writing Memtable-sstable_activity@361028148(455/4550 serialized/live bytes, 
> 180 ops)
>  INFO [FlushWriter:76] 2015-06-01 15:49:03,273 Memtable.java (line 402) 
> Completed flushing 
> /var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-103-Data.db
>  (248 bytes) for commitlog position ReplayPosition(segmentId=1432896540485, 
> position=1217022)
> 
> 
> 
> also on spark I found this exception
> 
> 
> 
> 15/06/01 16:43:30 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID 81)
> java.io.IOException: Failed to prepare statement INSERT INTO 
> "cassandrasink"."transaction" ("event_id", "isin", "security_type", 
> "security_name", "date", "time", "price", "currency", "user_id", "quantity", 
> "amount", "session_id") VALUES (:"event_id", :"isin", :"security_type", 
> :"security_name", :"date", :"time", :"price", :"currency", :"user_id", 
> :"quantity", :"amount", :"session_id"): All host(s) tried for query failed 
> (no host was tried)
> at com.datastax.spark.connector.writer.TableWriter.com 
> $datastax$spark$connector$writer$TableWriter$$prepareStatement(TableWriter.scala:96)
> at 
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:122)
> at 
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:120)
> at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:100)
> at 
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:99)
> at 
> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:151)
> at 
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:99)
> at 
> com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:120)
> at 
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
> at 
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All 
> host(s) tried for query failed (no host was tried)
> at 
> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
> at 
> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
> at 
> com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:91)
> at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at 
> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
> at $Proxy17.prepare(Unknown Source)
> at 
> com.datastax.spark.connector.cql.PreparedStatementCache$.prepareStatement(PreparedStatementCache.scala:45)
> at 
> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:28)
> at $Proxy17.prepare(Unknown Source)
> at com.datastax.spark.connector.writer.TableWriter.com 
> <

using pyspark with standalone cluster

2015-06-01 Thread AlexG
I've followed the instructions for setting up a standalone spark cluster (on
EC2):
- install spark on all the machines
- enabled passwordless ssh
- setup the conf/slaves file
- start the master and slaves with the provided scripts

The status on the 8080 port of the master tells me that the master and
executors are all running. I can successfully use pyspark from the master. 

However, if I try to call pyspark remotely from my laptop, with 
MASTER=spark://:7077 pyspark,
I get these errors:
15/06/01 10:02:14 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@:7077/user/Master...
15/06/01 10:02:34 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@:7077/user/Master...
15/06/01 10:02:54 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster@:7077/user/Master...
15/06/01 10:03:14 ERROR SparkDeploySchedulerBackend: Application has been
killed. Reason: All masters are unresponsive! Giving up.
15/06/01 10:03:14 ERROR TaskSchedulerImpl: Exiting due to error from cluster
scheduler: All masters are unresponsive! Giving up.

Any idea what's going on here? I set port 7077 to be publicly accessible...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-pyspark-with-standalone-cluster-tp23099.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-06-01 Thread Antonio Giambanco
:D very happy Helena I'll check tomorrow morning

A G

Il giorno 01/giu/2015, alle ore 19:45, Helena Edelson 
 ha scritto:

> Hi Antonio,
> 
> It’s your lucky day ;) We just released Spark Cassandra Connector 1.3.0-M1 
> for Spark 1.3 and DataSources API
> Give it a little while to propagate to 
> http://search.maven.org/#search%7Cga%7C1%7Cspark-cassandra-connector 
> 
> 'spark-cassandra-connector-java_2.10’ just tells me you are using the version 
> that is compiled against scala 2.10, what is the actual connector version 
> itself?
> 
> Thanks,
> Helena
> @helenaedelson
> 
> 
>> On Jun 1, 2015, at 1:08 PM, Antonio Giambanco  wrote:
>> 
>> Hi Helena,
>> thanks for answering me . . . 
>> I didn't realize it could be the connector version, unfortunately i didn't 
>> try yet.
>> I know scala is better but i'm using drools and i'm forced to use java
>> in my project i'm using spark-cassandra-connector-java_2.10
>> 
>> from cassandra I have only this log
>> 
>> INFO [ScheduledTasks:1] 2015-06-01 15:49:03,260 ColumnFamilyStore.java (line 
>> 795) Enqueuing flush of Memtable-sstable_activity@361028148(455/4550 
>> serialized/live bytes, 180 ops)
>>  INFO [FlushWriter:76] 2015-06-01 15:49:03,261 Memtable.java (line 362) 
>> Writing Memtable-sstable_activity@361028148(455/4550 serialized/live bytes, 
>> 180 ops)
>>  INFO [FlushWriter:76] 2015-06-01 15:49:03,273 Memtable.java (line 402) 
>> Completed flushing 
>> /var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-103-Data.db
>>  (248 bytes) for commitlog position ReplayPosition(segmentId=1432896540485, 
>> position=1217022)
>> 
>> 
>> 
>> also on spark I found this exception
>> 
>> 
>> 
>> 15/06/01 16:43:30 ERROR Executor: Exception in task 0.0 in stage 61.0 (TID 
>> 81)
>> java.io.IOException: Failed to prepare statement INSERT INTO 
>> "cassandrasink"."transaction" ("event_id", "isin", "security_type", 
>> "security_name", "date", "time", "price", "currency", "user_id", "quantity", 
>> "amount", "session_id") VALUES (:"event_id", :"isin", :"security_type", 
>> :"security_name", :"date", :"time", :"price", :"currency", :"user_id", 
>> :"quantity", :"amount", :"session_id"): All host(s) tried for query failed 
>> (no host was tried)
>> at 
>> com.datastax.spark.connector.writer.TableWriter.com$datastax$spark$connector$writer$TableWriter$$prepareStatement(TableWriter.scala:96)
>> at 
>> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:122)
>> at 
>> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:120)
>> at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:100)
>> at 
>> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:99)
>> at 
>> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:151)
>> at 
>> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:99)
>> at 
>> com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:120)
>> at 
>> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
>> at 
>> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
>> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All 
>> host(s) tried for query failed (no host was tried)
>> at 
>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>> at 
>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>> at 
>> com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:91)
>> at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:601)
>> at 
>> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
>> at $Proxy17.prepare(Unknown Source)
>> at 
>> com.datastax.spark.connector.cql.PreparedStatementCache$.prepareStatement(PreparedStatementCache.scala:45)
>> at 
>> com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:28

RE: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-01 Thread Mohammed Guller
Nobody using Spark SQL JDBC/Thrift server with DSE Cassandra?

Mohammed

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, May 29, 2015 11:49 AM
To: user@spark.apache.org
Subject: Anybody using Spark SQL JDBC server with DSE Cassandra?

Hi -

We have successfully integrated Spark SQL with Cassandra. We have a backend 
that provides a REST API that allows users to execute SQL queries on data in 
C*. Now we would like to also support JDBC/ODBC connectivity , so that user can 
use tools like Tableau to query data in C* through the Spark SQL JDBC server.

However, I have been unable to find a driver that would allow the Spark SQL 
Thrift/JDBC server to connect with Cassandra. DataStax provides a closed-source 
driver that comes only with the DSE version of Cassandra.

I would like to find out how many people are using the Spark SQL JDBC server + 
DSE Cassandra combination. If you do use Spark SQL JDBC server + DSE, I would 
appreciate if you could share your experience. For example, what kind of issues 
you have run into? How is the performance? What reporting tools you are using?

Thank  you.

Mohammed



Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Okehee Goh
Thanks, Michael and Akhil.
Yes, it worked with Spark 1.3.1 along with AWS EMR AMI 3.7.
Sorry I didn't update the status.


On Mon, Jun 1, 2015 at 5:17 AM, Michael Armbrust  wrote:
> This sounds like a problem that was fixed in Spark 1.3.1.
>
> https://issues.apache.org/jira/browse/SPARK-6351
>
> On Mon, Jun 1, 2015 at 5:44 PM, Akhil Das 
> wrote:
>>
>> This thread has various methods on accessing S3 from spark, it might help
>> you.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, May 24, 2015 at 8:03 AM, ogoh  wrote:
>>>
>>>
>>> Hello,
>>> I am using Spark1.3 in AWS.
>>> SparkSQL can't recognize Hive external table on S3.
>>> The following is the error message.
>>> I appreciate any help.
>>> Thanks,
>>> Okehee
>>> --
>>> 15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select
>>> count(*) from api_search where pdate='2015-05-08']
>>> java.lang.IllegalArgumentException: Wrong FS:
>>>
>>> s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00,
>>> expected: hdfs://10.128.193.211:9000
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
>>> at
>>> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
>>> at
>>>
>>> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>>> at
>>>
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Need some Cassandra integration help

2015-06-01 Thread Mohammed Guller
Hi Yana,
Not sure whether you already solved this issue. As far as I know, the DataFrame 
support in Spark Cassandra connector was added in version 1.3. The first 
milestone release of SCC v1.3 was just announced.

Mohammed

From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Tuesday, May 26, 2015 1:31 PM
To: user@spark.apache.org
Subject: Need some Cassandra integration help

Hi folks, for those of you working with Cassandra, wondering if anyone has been 
successful processing a mix of Cassandra and hdfs data. I have a dataset which 
is stored partially in HDFS and partially in Cassandra (schema is the same in 
both places)

I am trying to do the following:

val dfHDFS = sqlContext.parquetFile("foo.parquet")

val cassDF = cassandraContext.sql("SELECT * FROM keyspace.user")



 dfHDFS.unionAll(cassDF).count
​

This is failing for me with the following -

Exception in thread "main" java.lang.AssertionError: assertion failed: No plan 
for CassandraRelation 
TableDef(yana_test,test_connector,ArrayBuffer(ColumnDef(key,PartitionKeyColumn,IntType,false)),ArrayBuff

er(),ArrayBuffer(ColumnDef(value,RegularColumn,VarCharType,false))), None, None



at scala.Predef$.assert(Predef.scala:179)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:278)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)

at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:152)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1092)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1090)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1096)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1096)

at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)

at org.apache.spark.sql.DataFrame.count(DataFrame.scala:899)

at com.akamai.rum.test.unit.SparkTest.sparkPerfTest(SparkTest.java:123)

at com.akamai.rum.test.unit.SparkTest.main(SparkTest.java:37)
​

Is there a way to pull the data out of cassandra on each executor but not try 
to push logic down into casandra?




Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman & Meethu,
Apologies---I was wrong about KMeans supporting an initial set of
centroids!  JIRA created: https://issues.apache.org/jira/browse/SPARK-8018
If you're interested in submitting a PR, please do!
Thanks,
Joseph

On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW 
wrote:

> Hi Joseph,
> I was unable to find any function in Kmeans.scala where the initial
> centroids could be specified by the user. Kindly help.
>
> Thanks & Regards,
> Meethu M
>
>
>
>   On Tuesday, 19 May 2015 6:54 AM, Joseph Bradley 
> wrote:
>
>
> Hi Suman,
>
> For maxIterations, are you using the DenseKMeans.scala example code?  (I'm
> guessing yes since you mention the command line.)  If so, then you should
> be able to specify maxIterations via an extra parameter like
> "--numIterations 50" (note the example uses "numIterations" in the current
> master instead of "maxIterations," which is sort of a bug in the example).
> If that does not cap the max iterations, then please report it as a bug.
>
> To specify the initial centroids, you will need to modify the DenseKMeans
> example code.  Please see the KMeans API docs for details.
>
> Good luck,
> Joseph
>
> On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW 
> wrote:
>
> Hi,
> I think you cant supply an initial set of centroids to kmeans
>
> Thanks & Regards,
> Meethu M
>
>
>
>   On Friday, 15 May 2015 12:37 AM, Suman Somasundar <
> suman.somasun...@oracle.com> wrote:
>
>
> Hi,,
>
> I want to run a definite number of iterations in Kmeans.  There is a
> command line argument to set maxIterations, but even if I set it to a
> number, Kmeans runs until the centroids converge.
> Is there a specific way to specify it in command line?
>
> Also, I wanted to know if we can supply the initial set of centroids to
> the program instead of it choosing the centroids in random?
>
> Thanks,
> Suman.
>
>
>
>
>
>


RE: Migrate Relational to Distributed

2015-06-01 Thread Mohammed Guller
Brant,

You should be able to migrate most of your existing SQL code to Spark SQL, but 
remember that Spark SQL does not yet support the full ANSI standard. So you may 
need to rewrite some of your existing queries. 

Another thing to keep in mind is that Spark SQL is not real-time.  The response 
time for Spark SQL + Cassandra will not be the same as that of a 
properly-indexed database table (up to a certain size). On the other hand, the 
Spark SQL + Cassandra solution will scale better and provide higher throughput 
and availability more economically than an Oracle based solution. 

Mohammed

-Original Message-
From: Brant Seibert [mailto:brantseib...@hotmail.com] 
Sent: Friday, May 22, 2015 3:23 PM
To: user@spark.apache.org
Subject: Migrate Relational to Distributed

Hi,  The healthcare industry can do wonderful things with Apache Spark.  But, 
there is already a very large base of data and applications firmly rooted in 
the relational paradigm and they are resistent to change - stuck on Oracle.  

**
QUESTION 1 - Migrate legacy relational data (plus new transactions) to 
distributed storage?  

DISCUSSION 1 - The primary advantage I see is not having to engage in the 
lengthy (1+ years) process of creating a relational data warehouse and cubes.  
Just store the data in a distributed system and "analyze first" in memory with 
Spark.

**
QUESTION 2 - Will we have to re-write the enormous amount of logic that is 
already built for the old relational system?

DISCUSSION 2 - If we move the data to distributed, can we simply run that 
existing relational logic as SparkSQL queries?  [existing SQL --> Spark Context 
--> Cassandra --> process in SparkSQL --> display in existing UI]. 
Can we create an RDD that uses existing SQL?  Or do we need to rewrite all our 
SQL?

**
DATA SIZE - We are adding many new data sources to a system that already 
manages health care data for over a million people.  The number of rows may not 
be enormous right now compared to the advertising industry, for example, but 
the number of dimensions runs well into the thousands.  If we add to this, IoT 
data for each health care patient, that creates billions of events per day, and 
the number of rows then grows exponentially.  We would like to be prepared to 
handle that huge data scenario.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?

2015-06-01 Thread ogoh

Hello, 
I posted this question a while back but am posting it again to get your
attention.

I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS & YARN (tested under both
1.3.0 & 1.3.1). 
My hive table is partitioned.
I noticed that the query response time is bad depending on the number of
partitions though the query targets a small subset of the partitions. TRACE
level logs (ThriftServer's) showed that it runs commands like getFileInfo,
getListing, getBlockLocation for each every partitions ( also runs
getBlockLocation for each every files) though they are not part of the
queried partitions.

I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to
avoid that?
Below is the detail of reporting this issue including logs.

Thanks,


--

My Hive table as an external table is partitioned with date and hour. 
I expected that a query with certain partitions will read only the data
files of the partitions. 
I turned on TRACE level logging for ThriftServer since the query response
time even for narrowed partitions was very long. 
And I found that all the available partitions are checked during some steps. 

The logs showed as a execution flow  such as: 
== 
Step 1: Contacted HiveMetastore to get partition info  (cmd :
get_partitions) 

Step 2: Came up with an execution rule 

Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
available partitions of the table : 
   getFileInfo once, getListing once, and the repeat them again for each
partition. 

Step 4: Contact NameNode to find blocklocation of all the partitions 

Step 5: Contact DataNode for each file of all the partitions 

Step 6:  Contact NameNode  again for all the partitions 

Step 7: SparkSQL generated some optimal plan 

Step 8: Contacted corresponding datanodes for the narrowed partitions (it
seems) 
And more. 
=== 

Why Step3, 4, 5, and 6 should check all partitions? 
After removing partitions from the table, the query was much quicker while
processing same volume of data. 

I don't know if it is normal or Hive issue or SparkSQL issue or my
configuration issue. 
I added some logs below for some steps. 

I appreciate any of your advice. 

Thanks a lot, 
Okehee 

 some logs of some steps 

Query: select count(*) from api_search where pdate='2015-05-23'; 
( 

Step 2: 

2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - 

=== Applying Rule
org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === 

!'Project [COUNT(1) AS _c0#25L]Aggregate [],
[COUNT(1) AS _c0#25L] 

  Filter (pdate#26 = 2015-05-23)Filter (pdate#26 =
2015-05-23) 

   MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation
api_hdfs_perf, api_search, None 
.. 

Step 3: 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
/10.128.193.211:9000: getFileInfo {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 

2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms. 

2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #151 

2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
/10.128.193.211:9000 from ogoh: starting, having connections 2 

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #151 

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
13ms 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response <-
/10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
permission { perm: 493 } owner: "hadoop" group: "supergroup"
modification_time: 1432364487906 access_time: 0 block_replication: 0
blocksize: 0 fileId: 100602 childrenNum: 2 }} 

2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
/10.128.193.211:9000: getFileInfo {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 

2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh sending #152 

2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
to /10.128.193.211:9000 from ogoh got value #152 

2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
2ms. 
.. 


Step 4: 

2015-05-25 16:37:47 TRACE ProtobufRpcEngine:206 - 89: Call ->
/10.128.193.211:9000: getBlockLocations {src:
"/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-1.parquet"
offset: 0 length: 1342177280} 
... 


Step 5: 

2015-05-25 16:37:48 DEBUG DFSClient:951 - Connecting to datanode
10.191.137.197:9200 

2015-05-25 16:37:48 TRACE BlockReaderFactory:653 -
BlockReaderFactory(fileName=/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00/part-r-2015050800-2.parquet,
block=BP-1843960649-10.128.193.211-1427923845046:blk_10737

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
switching to use simple pojos instead of using avro for spark serialization
solved the problem(I mean reading avro from s3 and than mapping each avro
object to it's pojo serializable counterpart with same fields, pojo is
registered withing kryo)
Any thought where to look for a problem/misconfiguration?

On 31 May 2015 at 22:48, Igor Berman  wrote:

> Hi
> We are using spark 1.3.1
> Avro-chill (tomorrow will check if its important) we register avro classes
> from java
> Avro 1.7.6
> On May 31, 2015 22:37, "Josh Rosen"  wrote:
>
>> Which Spark version are you using?  I'd like to understand whether this
>> change could be caused by recent Kryo serializer re-use changes in master /
>> Spark 1.4.
>>
>> On Sun, May 31, 2015 at 11:31 AM, igor.berman 
>> wrote:
>>
>>> after investigation the problem is somehow connected to avro
>>> serialization
>>> with kryo + chill-avro(mapping avro object to simple scala case class and
>>> running reduce on these case class objects solves the problem)
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
It looks like it is possibly a race condition between removing the
IN_PROGRESS and building the history UI for the application.

`AppClient` sends an `UnregisterApplication(appId)` message to the `Master`
actor, which triggers the process to look for the app's eventLogs. If they
are suffixed with `.inprogress` then it will not build out the history UI
and instead build the error page I've seen.

Tying this together, calling SparkContext.stop() has the following block:


if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null } if
(_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false }
_eventLogger.foreach(_.stop())
Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend
which has an AppClient. AppClient sends itself a message to stop itself,
and like mentioned above, it then sends a message to the Master where it
tries to build the history UI.

Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix is
removed in the file-system. It seems like the race condition of the Akka
message passing to trigger the Master's building of the history UI may be
the only reason the history UI ever gets properly setup in the first place.
Because if the ordering of calls were all strict in the SparkContext.stop
method then you would expect the Master to always see the event logs as in
in progress.

Maybe I have missed something in tracing through the code? Is there a
reason that the eventLogger cannot be closed before the dagScheduler?

Thanks,
Richard

On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher 
wrote:

> Hi,
>
> In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
> folder on a Standalone cluster. This is generally working, all the logs are
> being written. However, from the Master Web UI, the vast majority of
> completed applications are labeled as not having a history:
> http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.&title=Application%20history%20not%20found%20(app-20150601160846-1914)
>
> The log does exists though:
>
> # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914
>
> -rw-rw   3 user group1027848 2015-06-01 16:09
> /eventLogs/app-20150601160846-1914
>
> and `cat` the file ends with:
>
> {"Event":"SparkListenerApplicationEnd","Timestamp":1433174955077}
>
> This seems to indicate it saw and logged the application end.
>
> Is there a known issue here or a workaround? Looking at the source code I
> might have expected these files to end in `.inprogress` given the UI error
> message, but they don't.
>
> Thanks,
> Richard
>


Dataframe random permutation?

2015-06-01 Thread Cesar Flores
I would like to know what will be the best approach to randomly permute a
Data Frame. I have tried:

df.sample(false,1.0,x).show(100)

where x is the seed. However, it gives the same result no matter the value
of x (it only gives different values when the fraction is smaller than 1.0)
. I have tried also:

hc.createDataFrame(df.rdd.repartition(100),df.schema)

which appears to be a random permutation. Can some one confirm me that the
last line is in fact a random permutation, or point me out to a better
approach?


Thanks
-- 
Cesar Flores


Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread John Omernik
All -

I am facing and odd issue and I am not really sure where to go for support
at this point.  I am running MapR which complicates things as it relates to
Mesos, however this HAS worked in the past with no issues so I am stumped
here.

So for starters, here is what I am trying to run. This is a simple show
tables using the Hive Context:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row, HiveContext
sparkhc = HiveContext(sc)
test = sparkhc.sql("show tables")
for r in test.collect():
  print r

When I run it on 1.3.1 using ./bin/pyspark --master local  This works with
no issues.

When I run it using Mesos with all the settings configured (as they had
worked in the past) I get lost tasks and when I zoom in them, the error
that is being reported is below.  Basically it's a NullPointerException on
the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
and compared both together, the class path, everything is exactly the same.
Yet running in local mode works, and running in mesos fails.  Also of note,
when the task is scheduled to run on the same node as when I run locally,
that fails too! (Baffling).

Ok, for comparison, how I configured Mesos was to download the mapr4
package from spark.apache.org.  Using the exact same configuration file
(except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
When I run this example with the mapr4 for 1.2.0 there is no issue in
Mesos, everything runs as intended. Using the same package for 1.3.1 then
it fails.

(Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
well).

So basically When I used 1.2.0 and followed a set of steps, it worked on
Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is
supports 1.2.1 only.  (Still working on that).

I guess I am at a loss right now on why this would be happening, any
pointers on where I could look or what I could tweak would be greatly
appreciated. Additionally, if there is something I could specifically draw
to the attention of MapR on this problem please let me know, I am perplexed
on the change from 1.2.0 to 1.3.1.

Thank you,

John




Full Error on 1.3.1 on Mesos:
15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
1060.3 MB java.lang.NullPointerException at
com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
at java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
at
org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at
org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959) at
org.apache.spark.storage.BlockManager.(BlockManager.scala:104) at
org.apache.spark.storage.BlockManager.(BlockManager.scala:179) at
org.apache.spark.SparkEnv$.create(SparkEnv.scala:310) at
org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:186) at
org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:70)
java.lang.RuntimeException: Failure loading MapRClient. at
com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:283) at
com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
at java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:274) at
org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
at
org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala

RE: SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?

2015-06-01 Thread java8964
Take a look about this Jira: https://issues.apache.org/jira/browse/SPARK-6910
Yong

> Date: Mon, 1 Jun 2015 12:26:16 -0700
> From: oke...@gmail.com
> To: user@spark.apache.org
> Subject: SparkSQL's performance gets degraded depending on number of 
> partitions of Hive tables..is it normal?
> 
> 
> Hello, 
> I posted this question a while back but am posting it again to get your
> attention.
> 
> I am using SparkSQL 1.3.1 and Hive 0.13.1 on AWS & YARN (tested under both
> 1.3.0 & 1.3.1). 
> My hive table is partitioned.
> I noticed that the query response time is bad depending on the number of
> partitions though the query targets a small subset of the partitions. TRACE
> level logs (ThriftServer's) showed that it runs commands like getFileInfo,
> getListing, getBlockLocation for each every partitions ( also runs
> getBlockLocation for each every files) though they are not part of the
> queried partitions.
> 
> I don't know why it is necessary. Is it a bug of SparkSql? Is there a way to
> avoid that?
> Below is the detail of reporting this issue including logs.
> 
> Thanks,
> 
> 
> --
> 
> My Hive table as an external table is partitioned with date and hour. 
> I expected that a query with certain partitions will read only the data
> files of the partitions. 
> I turned on TRACE level logging for ThriftServer since the query response
> time even for narrowed partitions was very long. 
> And I found that all the available partitions are checked during some steps. 
> 
> The logs showed as a execution flow  such as: 
> == 
> Step 1: Contacted HiveMetastore to get partition info  (cmd :
> get_partitions) 
> 
> Step 2: Came up with an execution rule 
> 
> Step 3: Contact namenode to make at least 4 calls (data is in HDFS) for all
> available partitions of the table : 
>getFileInfo once, getListing once, and the repeat them again for each
> partition. 
> 
> Step 4: Contact NameNode to find blocklocation of all the partitions 
> 
> Step 5: Contact DataNode for each file of all the partitions 
> 
> Step 6:  Contact NameNode  again for all the partitions 
> 
> Step 7: SparkSQL generated some optimal plan 
> 
> Step 8: Contacted corresponding datanodes for the narrowed partitions (it
> seems) 
> And more. 
> === 
> 
> Why Step3, 4, 5, and 6 should check all partitions? 
> After removing partitions from the table, the query was much quicker while
> processing same volume of data. 
> 
> I don't know if it is normal or Hive issue or SparkSQL issue or my
> configuration issue. 
> I added some logs below for some steps. 
> 
> I appreciate any of your advice. 
> 
> Thanks a lot, 
> Okehee 
> 
>  some logs of some steps 
> 
> Query: select count(*) from api_search where pdate='2015-05-23'; 
> ( 
> 
> Step 2: 
> 
> 2015-05-25 16:37:43 TRACE HiveContext$$anon$3:67 - 
> 
> === Applying Rule
> org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates === 
> 
> !'Project [COUNT(1) AS _c0#25L]Aggregate [],
> [COUNT(1) AS _c0#25L] 
> 
>   Filter (pdate#26 = 2015-05-23)Filter (pdate#26 =
> 2015-05-23) 
> 
>MetastoreRelation api_hdfs_perf, api_search, None MetastoreRelation
> api_hdfs_perf, api_search, None 
> .. 
> 
> Step 3: 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
> /10.128.193.211:9000: getFileInfo {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 
> 
> 2015-05-25 16:37:44 DEBUG Client:424 - The ping interval is 6 ms. 
> 
> 2015-05-25 16:37:44 DEBUG Client:693 - Connecting to /10.128.193.211:9000 
> 
> 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh sending #151 
> 
> 2015-05-25 16:37:44 DEBUG Client:944 - IPC Client (2100771791) connection to
> /10.128.193.211:9000 from ogoh: starting, having connections 2 
> 
> 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh got value #151 
> 
> 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
> 13ms 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:250 - 84: Response <-
> /10.128.193.211:9000: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
> permission { perm: 493 } owner: "hadoop" group: "supergroup"
> modification_time: 1432364487906 access_time: 0 block_replication: 0
> blocksize: 0 fileId: 100602 childrenNum: 2 }} 
> 
> 2015-05-25 16:37:44 TRACE ProtobufRpcEngine:206 - 84: Call ->
> /10.128.193.211:9000: getFileInfo {src:
> "/user/datawarehouse/api_hdfs_perf/api_search/pdate=2015-05-08/phour=00"} 
> 
> 2015-05-25 16:37:44 DEBUG Client:1007 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh sending #152 
> 
> 2015-05-25 16:37:44 DEBUG Client:1064 - IPC Client (2100771791) connection
> to /10.128.193.211:9000 from ogoh got value #152 
> 
> 2015-05-25 16:37:44 DEBUG ProtobufRpcEngine:235 - Call: getFileInfo took
> 2ms. 
> .. 
> 
> 
> Step 4: 
> 
> 2015-05-25 16:37:47 T

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
How much work is to produce a small standalone reproduction?  Can you
create an Avro file with some mock data, maybe 10 or so records, then
reproduce this locally?

On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman  wrote:

> switching to use simple pojos instead of using avro for spark
> serialization solved the problem(I mean reading avro from s3 and than
> mapping each avro object to it's pojo serializable counterpart with same
> fields, pojo is registered withing kryo)
> Any thought where to look for a problem/misconfiguration?
>
> On 31 May 2015 at 22:48, Igor Berman  wrote:
>
>> Hi
>> We are using spark 1.3.1
>> Avro-chill (tomorrow will check if its important) we register avro
>> classes from java
>> Avro 1.7.6
>> On May 31, 2015 22:37, "Josh Rosen"  wrote:
>>
>>> Which Spark version are you using?  I'd like to understand whether this
>>> change could be caused by recent Kryo serializer re-use changes in master /
>>> Spark 1.4.
>>>
>>> On Sun, May 31, 2015 at 11:31 AM, igor.berman 
>>> wrote:
>>>
 after investigation the problem is somehow connected to avro
 serialization
 with kryo + chill-avro(mapping avro object to simple scala case class
 and
 running reduce on these case class objects solves the problem)




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>


Re: Event Logging to HDFS on Standalone Cluster "In Progress"

2015-06-01 Thread Richard Marscher
Ah, apologies, I found an existing issue and fix has already gone out for
this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036.

On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher 
wrote:

> It looks like it is possibly a race condition between removing the
> IN_PROGRESS and building the history UI for the application.
>
> `AppClient` sends an `UnregisterApplication(appId)` message to the
> `Master` actor, which triggers the process to look for the app's eventLogs.
> If they are suffixed with `.inprogress` then it will not build out the
> history UI and instead build the error page I've seen.
>
> Tying this together, calling SparkContext.stop() has the following block:
>
>
> if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null }
> if (_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false
> } _eventLogger.foreach(_.stop())
> Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend
> which has an AppClient. AppClient sends itself a message to stop itself,
> and like mentioned above, it then sends a message to the Master where it
> tries to build the history UI.
>
> Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix
> is removed in the file-system. It seems like the race condition of the Akka
> message passing to trigger the Master's building of the history UI may be
> the only reason the history UI ever gets properly setup in the first place.
> Because if the ordering of calls were all strict in the SparkContext.stop
> method then you would expect the Master to always see the event logs as in
> in progress.
>
> Maybe I have missed something in tracing through the code? Is there a
> reason that the eventLogger cannot be closed before the dagScheduler?
>
> Thanks,
> Richard
>
> On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Hi,
>>
>> In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
>> folder on a Standalone cluster. This is generally working, all the logs are
>> being written. However, from the Master Web UI, the vast majority of
>> completed applications are labeled as not having a history:
>> http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.&title=Application%20history%20not%20found%20(app-20150601160846-1914)
>>
>> The log does exists though:
>>
>> # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914
>>
>> -rw-rw   3 user group1027848 2015-06-01 16:09
>> /eventLogs/app-20150601160846-1914
>>
>> and `cat` the file ends with:
>>
>> {"Event":"SparkListenerApplicationEnd","Timestamp":1433174955077}
>>
>> This seems to indicate it saw and logged the application end.
>>
>> Is there a known issue here or a workaround? Looking at the source code I
>> might have expected these files to end in `.inprogress` given the UI error
>> message, but they don't.
>>
>> Thanks,
>> Richard
>>
>
>


Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko

Hi Cesar,
try to do:

hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) 
It's a bit inefficient, but should shuffle the whole dataframe.

Thanks,
Peter Rudenko
On 2015-06-01 22:49, Cesar Flores wrote:


I would like to know what will be the best approach to randomly 
permute a Data Frame. I have tried:


df.sample(false,1.0,x).show(100)

where x is the seed. However, it gives the same result no matter the 
value of x (it only gives different values when the fraction is 
smaller than 1.0) . I have tried also:


hc.createDataFrame(df.rdd.repartition(100),df.schema)

which appears to be a random permutation. Can some one confirm me that 
the last line is in fact a random permutation, or point me out to a 
better approach?



Thanks
--
Cesar Flores




Re: PySpark with OpenCV causes python worker to crash

2015-06-01 Thread Davies Liu
Could you run the single thread version in worker machine to make sure
that OpenCV is installed and configured correctly?

On Sat, May 30, 2015 at 6:29 AM, Sam Stoelinga  wrote:
> I've verified the issue lies within Spark running OpenCV code and not within
> the sequence file BytesWritable formatting.
>
> This is the code which can reproduce that spark is causing the failure by
> not using the sequencefile as input at all but running the same function
> with same input on spark but fails:
>
> def extract_sift_features_opencv(imgfile_imgbytes):
> imgfilename, discardsequencefile = imgfile_imgbytes
> imgbytes = bytearray(open("/tmp/img.jpg", "rb").read())
> nparr = np.fromstring(buffer(imgbytes), np.uint8)
> img = cv2.imdecode(nparr, 1)
> gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
> sift = cv2.xfeatures2d.SIFT_create()
> kp, descriptors = sift.detectAndCompute(gray, None)
> return (imgfilename, "test")
>
> And corresponding tests.py:
> https://gist.github.com/samos123/d383c26f6d47d34d32d6
>
>
> On Sat, May 30, 2015 at 8:04 PM, Sam Stoelinga 
> wrote:
>>
>> Thanks for the advice! The following line causes spark to crash:
>>
>> kp, descriptors = sift.detectAndCompute(gray, None)
>>
>> But I do need this line to be executed and the code does not crash when
>> running outside of Spark but passing the same parameters. You're saying
>> maybe the bytes from the sequencefile got somehow transformed and don't
>> represent an image anymore causing OpenCV to crash the whole python
>> executor.
>>
>> On Fri, May 29, 2015 at 2:06 AM, Davies Liu  wrote:
>>>
>>> Could you try to comment out some lines in
>>> `extract_sift_features_opencv` to find which line cause the crash?
>>>
>>> If the bytes came from sequenceFile() is broken, it's easy to crash a
>>> C library in Python (OpenCV).
>>>
>>> On Thu, May 28, 2015 at 8:33 AM, Sam Stoelinga 
>>> wrote:
>>> > Hi sparkers,
>>> >
>>> > I am working on a PySpark application which uses the OpenCV library. It
>>> > runs
>>> > fine when running the code locally but when I try to run it on Spark on
>>> > the
>>> > same Machine it crashes the worker.
>>> >
>>> > The code can be found here:
>>> > https://gist.github.com/samos123/885f9fe87c8fa5abf78f
>>> >
>>> > This is the error message taken from STDERR of the worker log:
>>> > https://gist.github.com/samos123/3300191684aee7fc8013
>>> >
>>> > Would like pointers or tips on how to debug further? Would be nice to
>>> > know
>>> > the reason why the worker crashed.
>>> >
>>> > Thanks,
>>> > Sam Stoelinga
>>> >
>>> >
>>> > org.apache.spark.SparkException: Python worker exited unexpectedly
>>> > (crashed)
>>> > at
>>> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:172)
>>> > at
>>> >
>>> > org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:176)
>>> > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
>>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>> > at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>> > at
>>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>> > at
>>> >
>>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> > at
>>> >
>>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: java.io.EOFException
>>> > at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>> > at
>>> > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108)
>>> >
>>> >
>>> >
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



map - reduce only with disk

2015-06-01 Thread octavian.ganea
Dear all,

Does anyone know how can I force Spark to use only the disk when doing a
simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-01 Thread Stephen Boesch
I downloaded the 1.3.1 distro tarball

$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve  staff  8500861 Apr 23 09:58 ../spark-1.3.1.tar.gz

However the build on it is failing with an unresolved dependency:
*configuration
not public*

$ build/sbt   assembly -Dhadoop.version=2.5.2 -Pyarn -Phadoop-2.4

[error] (network-shuffle/*:update) sbt.ResolveException: *unresolved
dependency: *org.apache.spark#spark-network-common_2.10;1.3.1: *configuration
not public* in org.apache.spark#spark-network-common_2.10;1.3.1: 'test'. It
was required from org.apache.spark#spark-network-shuffle_2.10;1.3.1 test

Is there a known workaround for this?

thanks


How to monitor Spark Streaming from Kafka?

2015-06-01 Thread dgoldenberg
Hi,

What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka?  I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
assume that Receiver-based streaming is used?

Then "Note that one disadvantage of this approach (Receiverless Approach,
#2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
Kafka monitoring tools will not show progress. However, you can access the
offsets processed by this approach in each batch and update Zookeeper
yourself".

The code sample, however, seems sparse. What do you need to do here? -
 directKafkaStream.foreachRDD(
 new Function, Void>() {
 @Override
 public Void call(JavaPairRDD rdd) throws
IOException {
 OffsetRange[] offsetRanges =
((HasOffsetRanges)rdd).offsetRanges
 // offsetRanges.length = # of Kafka partitions being consumed
 ...
 return null;
 }
 }
 );

and if these are updated, will KafkaOffsetMonitor work?

Monitoring seems to center around the notion of a consumer group.  But in
the receiverless approach, code on the Spark consumer side doesn't seem to
expose a consumer group parameter.  Where does it go?  Can I/should I just
pass in group.id as part of the kafkaParams HashMap?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Tathagata Das
In the receiver-less "direct" approach, there is no concept of consumer
group as we dont use the Kafka High Level consumer (that uses ZK). Instead
Spark Streaming manages offsets on its own, giving tighter guarantees. If
you want to monitor the progress of the processing of offsets, you will
have to update ZK yourself. With the code snippet you posted, you can get
the range of offsets that were processed in each batch, and accordingly
update Zookeeper using some consumer group name of your choice.

TD

On Mon, Jun 1, 2015 at 2:23 PM, dgoldenberg 
wrote:

> Hi,
>
> What are some of the good/adopted approached to monitoring Spark Streaming
> from Kafka?  I see that there are things like
> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
> assume that Receiver-based streaming is used?
>
> Then "Note that one disadvantage of this approach (Receiverless Approach,
> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
> Kafka monitoring tools will not show progress. However, you can access the
> offsets processed by this approach in each batch and update Zookeeper
> yourself".
>
> The code sample, however, seems sparse. What do you need to do here? -
>  directKafkaStream.foreachRDD(
>  new Function, Void>() {
>  @Override
>  public Void call(JavaPairRDD rdd) throws
> IOException {
>  OffsetRange[] offsetRanges =
> ((HasOffsetRanges)rdd).offsetRanges
>  // offsetRanges.length = # of Kafka partitions being consumed
>  ...
>  return null;
>  }
>  }
>  );
>
> and if these are updated, will KafkaOffsetMonitor work?
>
> Monitoring seems to center around the notion of a consumer group.  But in
> the receiverless approach, code on the Spark consumer side doesn't seem to
> expose a consumer group parameter.  Where does it go?  Can I/should I just
> pass in group.id as part of the kafkaParams HashMap?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-06-01 Thread Tathagata Das
Interesting, only in local[*]! In the github you pointed to, what is the
main that you were running.

TD


On Mon, May 25, 2015 at 9:23 AM, rsearle  wrote:

> Further experimentation indicates these problems only occur when master is
> local[*].
>
> There are no issues if a standalone cluster is used.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-updateStateByKey-fails-with-class-leak-when-using-case-classes-resend-tp22793p23020.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Cody Koeninger
KafkaCluster.scala in the spark/extrernal/kafka project has a bunch of api
code, including code for updating Kafka-managed ZK offsets.  Look at
setConsumerOffsets.

Unfortunately all of that code is private, but you can either write your
own, copy it, or do what I do (sed out private[spark] and rebuild spark...)

On Mon, Jun 1, 2015 at 4:51 PM, Tathagata Das  wrote:

> In the receiver-less "direct" approach, there is no concept of consumer
> group as we dont use the Kafka High Level consumer (that uses ZK). Instead
> Spark Streaming manages offsets on its own, giving tighter guarantees. If
> you want to monitor the progress of the processing of offsets, you will
> have to update ZK yourself. With the code snippet you posted, you can get
> the range of offsets that were processed in each batch, and accordingly
> update Zookeeper using some consumer group name of your choice.
>
> TD
>
> On Mon, Jun 1, 2015 at 2:23 PM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> What are some of the good/adopted approached to monitoring Spark Streaming
>> from Kafka?  I see that there are things like
>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
>> assume that Receiver-based streaming is used?
>>
>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
>> Kafka monitoring tools will not show progress. However, you can access the
>> offsets processed by this approach in each batch and update Zookeeper
>> yourself".
>>
>> The code sample, however, seems sparse. What do you need to do here? -
>>  directKafkaStream.foreachRDD(
>>  new Function, Void>() {
>>  @Override
>>  public Void call(JavaPairRDD rdd) throws
>> IOException {
>>  OffsetRange[] offsetRanges =
>> ((HasOffsetRanges)rdd).offsetRanges
>>  // offsetRanges.length = # of Kafka partitions being consumed
>>  ...
>>  return null;
>>  }
>>  }
>>  );
>>
>> and if these are updated, will KafkaOffsetMonitor work?
>>
>> Monitoring seems to center around the notion of a consumer group.  But in
>> the receiverless approach, code on the Spark consumer side doesn't seem to
>> expose a consumer group parameter.  Where does it go?  Can I/should I just
>> pass in group.id as part of the kafkaParams HashMap?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Andrew Or
Hi Deepak,

This is a notorious bug that is being tracked at
https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one source
of this bug (it turns out Snappy had a bug in buffer reuse that caused data
corruption). There are other known sources that are being addressed in
outstanding patches currently.

Since you're using 1.3.1 my guess is that you don't have this patch:
https://github.com/apache/spark/pull/6176, which I believe should fix the
issue in your case. It's merged for 1.3.2 (not yet released) but not in
time for 1.3.1, so feel free to patch it yourself and see if it works.

-Andrew


2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :

> Any suggestions ?
>
> I using Spark 1.3.1 to read   sequence file stored in Sequence File format
> (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
> )
>
> with this code and settings
> sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
> org.apache.spark.HashPartitioner(2053))
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryoserializer.buffer.mb",
> arguments.get("buffersize").get)
>   .set("spark.kryoserializer.buffer.max.mb",
> arguments.get("maxbuffersize").get)
>   .set("spark.driver.maxResultSize",
> arguments.get("maxResultSize").get)
>   .set("spark.yarn.maxAppAttempts", "0")
>   //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
>   //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
>   //.set("spark.worker.timeout", arguments.get("workerTimeout").get)
>
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>
>
> and values are
> buffersize=128 maxbuffersize=1068 maxResultSize=200G
>
>
> And i see this exception in each executor task
>
> FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
> 54757), shuffleId=6, mapId=2810, reduceId=1117, message=
>
> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*
>
> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>
> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>
> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>
> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>
> at
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
>
> at
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
>
> at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>
> at
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
>
> at
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:301)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
>
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at scala.util.Success.map(Try.scala:206)
>
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(Shuffle

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider
using LZF compression instead, since that codec isn't affected by this bug.

On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or  wrote:

> Hi Deepak,
>
> This is a notorious bug that is being tracked at
> https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one
> source of this bug (it turns out Snappy had a bug in buffer reuse that
> caused data corruption). There are other known sources that are being
> addressed in outstanding patches currently.
>
> Since you're using 1.3.1 my guess is that you don't have this patch:
> https://github.com/apache/spark/pull/6176, which I believe should fix the
> issue in your case. It's merged for 1.3.2 (not yet released) but not in
> time for 1.3.1, so feel free to patch it yourself and see if it works.
>
> -Andrew
>
>
> 2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :
>
>> Any suggestions ?
>>
>> I using Spark 1.3.1 to read   sequence file stored in Sequence File
>> format
>> (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
>> )
>>
>> with this code and settings
>> sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
>> org.apache.spark.HashPartitioner(2053))
>> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>>   .set("spark.kryoserializer.buffer.mb",
>> arguments.get("buffersize").get)
>>   .set("spark.kryoserializer.buffer.max.mb",
>> arguments.get("maxbuffersize").get)
>>   .set("spark.driver.maxResultSize",
>> arguments.get("maxResultSize").get)
>>   .set("spark.yarn.maxAppAttempts", "0")
>>   //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
>>   //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
>>   //.set("spark.worker.timeout", arguments.get("workerTimeout").get)
>>
>> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>>
>>
>> and values are
>> buffersize=128 maxbuffersize=1068 maxResultSize=200G
>>
>>
>> And i see this exception in each executor task
>>
>> FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
>> 54757), shuffleId=6, mapId=2810, reduceId=1117, message=
>>
>> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
>>
>> at
>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>>
>> at
>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>>
>> at
>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>>
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>
>> at
>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>
>> at
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at
>> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
>>
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
>>
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>>
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*
>>
>> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>>
>> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>>
>> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>>
>> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>>
>> at
>> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
>>
>> at
>> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
>>
>> at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>>
>> at
>> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:160)
>>
>> at
>> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)
>>
>> at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:3

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Otis Gospodnetic
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box.  I see more and more people using it to monitor various
components in data processing pipelines, a la
http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/

Otis

On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
wrote:

> Hi,
>
> What are some of the good/adopted approached to monitoring Spark Streaming
> from Kafka?  I see that there are things like
> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
> assume that Receiver-based streaming is used?
>
> Then "Note that one disadvantage of this approach (Receiverless Approach,
> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
> Kafka monitoring tools will not show progress. However, you can access the
> offsets processed by this approach in each batch and update Zookeeper
> yourself".
>
> The code sample, however, seems sparse. What do you need to do here? -
>  directKafkaStream.foreachRDD(
>  new Function, Void>() {
>  @Override
>  public Void call(JavaPairRDD rdd) throws
> IOException {
>  OffsetRange[] offsetRanges =
> ((HasOffsetRanges)rdd).offsetRanges
>  // offsetRanges.length = # of Kafka partitions being consumed
>  ...
>  return null;
>  }
>  }
>  );
>
> and if these are updated, will KafkaOffsetMonitor work?
>
> Monitoring seems to center around the notion of a consumer group.  But in
> the receiverless approach, code on the Spark consumer side doesn't seem to
> expose a consumer group parameter.  Where does it go?  Can I/should I just
> pass in group.id as part of the kafkaParams HashMap?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread Dean Wampler
It would be nice to see the code for MapR FS Java API, but my google foo
failed me (assuming it's open source)...

So, shooting in the dark ;) there are a few things I would check, if you
haven't already:

1. Could there be 1.2 versions of some Spark jars that get picked up at run
time (but apparently not in local mode) on one or more nodes? (Side
question: Does your node experiment fail on all nodes?) Put another way,
are the classpaths good for all JVM tasks?
2. Can you use just MapR and Spark 1.3.1 successfully, bypassing Mesos?

Incidentally, how are you combining Mesos and MapR? Are you running Spark
in Mesos, but accessing data in MapR-FS?

Perhaps the MapR "shim" library doesn't support Spark 1.3.1.

HTH,

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Mon, Jun 1, 2015 at 2:49 PM, John Omernik  wrote:

> All -
>
> I am facing and odd issue and I am not really sure where to go for support
> at this point.  I am running MapR which complicates things as it relates to
> Mesos, however this HAS worked in the past with no issues so I am stumped
> here.
>
> So for starters, here is what I am trying to run. This is a simple show
> tables using the Hive Context:
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, Row, HiveContext
> sparkhc = HiveContext(sc)
> test = sparkhc.sql("show tables")
> for r in test.collect():
>   print r
>
> When I run it on 1.3.1 using ./bin/pyspark --master local  This works with
> no issues.
>
> When I run it using Mesos with all the settings configured (as they had
> worked in the past) I get lost tasks and when I zoom in them, the error
> that is being reported is below.  Basically it's a NullPointerException on
> the com.mapr.fs.ShimLoader.  What's weird to me is is I took each instance
> and compared both together, the class path, everything is exactly the same.
> Yet running in local mode works, and running in mesos fails.  Also of note,
> when the task is scheduled to run on the same node as when I run locally,
> that fails too! (Baffling).
>
> Ok, for comparison, how I configured Mesos was to download the mapr4
> package from spark.apache.org.  Using the exact same configuration file
> (except for changing the executor tgz from 1.2.0 to 1.3.1) from the 1.2.0.
> When I run this example with the mapr4 for 1.2.0 there is no issue in
> Mesos, everything runs as intended. Using the same package for 1.3.1 then
> it fails.
>
> (Also of note, 1.2.1 gives a 404 error, 1.2.2 fails, and 1.3.0 fails as
> well).
>
> So basically When I used 1.2.0 and followed a set of steps, it worked on
> Mesos and 1.3.1 fails.  Since this is a "current" version of Spark, MapR is
> supports 1.2.1 only.  (Still working on that).
>
> I guess I am at a loss right now on why this would be happening, any
> pointers on where I could look or what I could tweak would be greatly
> appreciated. Additionally, if there is something I could specifically draw
> to the attention of MapR on this problem please let me know, I am perplexed
> on the change from 1.2.0 to 1.3.1.
>
> Thank you,
>
> John
>
>
>
>
> Full Error on 1.3.1 on Mesos:
> 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity
> 1060.3 MB java.lang.NullPointerException at
> com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96) at
> com.mapr.fs.ShimLoader.injectNativeLoader(ShimLoader.java:232) at
> com.mapr.fs.ShimLoader.load(ShimLoader.java:194) at
> org.apache.hadoop.conf.CoreDefaultProperties.(CoreDefaultProperties.java:60)
> at java.lang.Class.forName0(Native Method) at
> java.lang.Class.forName(Class.java:274) at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1847)
> at
> org.apache.hadoop.conf.Configuration.getProperties(Configuration.java:2062)
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2272)
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2224)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2141)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:992) at
> org.apache.hadoop.conf.Configuration.set(Configuration.java:966) at
> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:98)
> at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:43) at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:220) at
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at
> org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1959) at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:104) at
> org.apache.spark.storage.BlockManager.(BlockManager.scala:179) at
> org.apache.spark.SparkEnv$.create(SparkEnv.scala:310) at
> org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:186) at
> org.apache.spar

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis.

- Dmitry


On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic  wrote:

> I think you can use SPM - http://sematext.com/spm - it will give you all
> Spark and all Kafka metrics, including offsets broken down by topic, etc.
> out of the box.  I see more and more people using it to monitor various
> components in data processing pipelines, a la
> http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/
>
> Otis
>
> On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> What are some of the good/adopted approached to monitoring Spark Streaming
>> from Kafka?  I see that there are things like
>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
>> assume that Receiver-based streaming is used?
>>
>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
>> Kafka monitoring tools will not show progress. However, you can access the
>> offsets processed by this approach in each batch and update Zookeeper
>> yourself".
>>
>> The code sample, however, seems sparse. What do you need to do here? -
>>  directKafkaStream.foreachRDD(
>>  new Function, Void>() {
>>  @Override
>>  public Void call(JavaPairRDD rdd) throws
>> IOException {
>>  OffsetRange[] offsetRanges =
>> ((HasOffsetRanges)rdd).offsetRanges
>>  // offsetRanges.length = # of Kafka partitions being consumed
>>  ...
>>  return null;
>>  }
>>  }
>>  );
>>
>> and if these are updated, will KafkaOffsetMonitor work?
>>
>> Monitoring seems to center around the notion of a consumer group.  But in
>> the receiverless approach, code on the Spark consumer side doesn't seem to
>> expose a consumer group parameter.  Where does it go?  Can I/should I just
>> pass in group.id as part of the kafkaParams HashMap?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


HDFS Rest Service not available

2015-06-01 Thread Su She
Hello All,

A bit scared I did something stupid...I killed a few PIDs that were
listening to ports 2183 (kafka), 4042 (spark app), some of the PIDs
didn't even seem to be stopped as they still are running when i do

lsof -i:[port number]

I'm not sure if the problem started after or before I did these kill
commands, but I now can't connect to HDFS or start spark. I can't seem
to access Hue. I am afraid I accidentally killed an important process
related to HDFS. But, I am not sure what it would be as I couldn't
even kill the PIDs.

Is it a coincidence that HDFS failed randomly? Likely that I killed an
important PID? How can I maybe restart HDFS?

Thanks a lot!

Error on Hue:

Cannot access: /user/ec2-user. The HDFS REST service is not available.
Note: You are a Hue admin but not a HDFS superuser (which is "hdfs").

HTTPConnectionPool(host='ec2-ip-address.us-west-1.compute.amazonaws.com',
port=50070): Max retries exceeded with url:
/webhdfs/v1/user/ec2-user?op=GETFILESTATUS&user.name=hue&doas=ec2-user
(Caused by : [Errno 111] Connection refused)

Error when I try to open spark-shell or a spark app:
java.net.ConnectException: Call From
ip-10-0-2-216.us-west-1.compute.internal/10.0.2.216 to
ip-10-0-2-216.us-west-1.compute.internal:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:744)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy21.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1921)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1089)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:123)
at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
at org.apache.spark.SparkContext.(SparkContext.scala:353)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkIL

Re: Best strategy for Pandas -> Spark

2015-06-01 Thread Davies Liu
The second one sounds reasonable, I think.

On Thu, Apr 30, 2015 at 1:42 AM, Olivier Girardot
 wrote:
> Hi everyone,
> Let's assume I have a complex workflow of more than 10 datasources as input
> - 20 computations (some creating intermediary datasets and some merging
> everything for the final computation) - some taking on average 1 minute to
> complete and some taking more than 30 minutes.
>
> What would be for you the best strategy to port this to Apache Spark ?
>
> Transform the whole flow into a Spark Job (PySpark or Scala)
> Transform only part of the flow (the heavy lifting ~30 min parts) using the
> same language (PySpark)
> Transform only part of the flow and pipe the rest from Scala to Python
>
> Regards,
>
> Olivier.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: deos randomSplit return a copy or a reference to the original rdd? [Python]

2015-06-01 Thread Davies Liu
No, all of the RDDs (including those returned from randomSplit()) are read-only.

On Mon, Apr 27, 2015 at 11:28 AM, Pagliari, Roberto
 wrote:
> Suppose I have something like the code below
>
>
> for idx in xrange(0, 10):
> train_test_split = training.randomSplit(weights=[0.75, 0.25])
> train_cv = train_test_split[0]
> test_cv = train_test_split[1]
> # scale train_cv and test_cv
>
>
> by scaling train_cv and test_cv, will the original data be affected?
>
> Thanks,
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Building Spark for Hadoop 2.6.0

2015-06-01 Thread Mulugeta Mammo
Does this build Spark for hadoop version 2.6.0?

build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
package

Thanks!


Re: Building Spark for Hadoop 2.6.0

2015-06-01 Thread Ted Yu
Looks good.

-Dhadoop.version is not needed because the profile already defines it.


  hadoop-2.6
  
2.6.0

On Mon, Jun 1, 2015 at 5:51 PM, Mulugeta Mammo 
wrote:

> Does this build Spark for hadoop version 2.6.0?
>
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package
>
> Thanks!
>


GroupBy on RDD returns empty collection

2015-06-01 Thread Malte
I noticed that my spark jobs suddenly return empty data and tried to find out
why. It seems as if a groupBy operation is the cause of it. When I run 

val original:RDD[Data]
val x = original.cache().groupBy(x=>(x.first,x.last,x.date))

and then try
println(s"${x.first()}")

I get an
Exception in thread "main" java.lang.UnsupportedOperationException: empty
collection

original definitely is not empty. 

I use Spark 1.2.1 on Mesos

any ideas?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GroupBy-on-RDD-returns-empty-collection-tp23105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: map - reduce only with disk

2015-06-01 Thread Matei Zaharia
As long as you don't use cache(), these operations will go from disk to disk, 
and will only use a fixed amount of memory to build some intermediate results. 
However, note that because you're using groupByKey, that needs the values for 
each key to all fit in memory at once. In this case, if you're going to reduce 
right after, you should use reduceByKey, which will be more efficient.

Matei

> On Jun 1, 2015, at 2:21 PM, octavian.ganea  wrote:
> 
> Dear all,
> 
> Does anyone know how can I force Spark to use only the disk when doing a
> simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3.0: how to let Spark history load old records?

2015-06-01 Thread Haopu Wang
When I start the Spark master process, the old records are not shown in
the monitoring UI.

How to show the old records? Thank you very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread ๏̯͡๏
Hello Josh,
Are you suggesting to store the source data in LZF compression and use the
same Spark code as is ?
Currently its stored in sequence file format and compressed with GZIP.

First line of the data:

(SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'
org.apache.hadoop.io.compress.GzipCodec?v?
)

Regards,
Deepak

On Tue, Jun 2, 2015 at 4:16 AM, Josh Rosen  wrote:

> If you can't run a patched Spark version, then you could also consider
> using LZF compression instead, since that codec isn't affected by this bug.
>
> On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or  wrote:
>
>> Hi Deepak,
>>
>> This is a notorious bug that is being tracked at
>> https://issues.apache.org/jira/browse/SPARK-4105. We have fixed one
>> source of this bug (it turns out Snappy had a bug in buffer reuse that
>> caused data corruption). There are other known sources that are being
>> addressed in outstanding patches currently.
>>
>> Since you're using 1.3.1 my guess is that you don't have this patch:
>> https://github.com/apache/spark/pull/6176, which I believe should fix
>> the issue in your case. It's merged for 1.3.2 (not yet released) but not in
>> time for 1.3.1, so feel free to patch it yourself and see if it works.
>>
>> -Andrew
>>
>>
>> 2015-06-01 8:00 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :
>>
>>> Any suggestions ?
>>>
>>> I using Spark 1.3.1 to read   sequence file stored in Sequence File
>>> format
>>> (SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v?
>>> )
>>>
>>> with this code and settings
>>> sc.sequenceFile(dwTable, classOf[Text], classOf[Text]).partitionBy(new
>>> org.apache.spark.HashPartitioner(2053))
>>> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>>>   .set("spark.kryoserializer.buffer.mb",
>>> arguments.get("buffersize").get)
>>>   .set("spark.kryoserializer.buffer.max.mb",
>>> arguments.get("maxbuffersize").get)
>>>   .set("spark.driver.maxResultSize",
>>> arguments.get("maxResultSize").get)
>>>   .set("spark.yarn.maxAppAttempts", "0")
>>>   //.set("spark.akka.askTimeout", arguments.get("askTimeout").get)
>>>   //.set("spark.akka.timeout", arguments.get("akkaTimeout").get)
>>>   //.set("spark.worker.timeout", arguments.get("workerTimeout").get)
>>>
>>> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum]))
>>>
>>>
>>> and values are
>>> buffersize=128 maxbuffersize=1068 maxResultSize=200G
>>>
>>>
>>> And i see this exception in each executor task
>>>
>>> FetchFailed(BlockManagerId(278, phxaishdc9dn1830.stratus.phx.ebay.com,
>>> 54757), shuffleId=6, mapId=2810, reduceId=1117, message=
>>>
>>> org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
>>>
>>> at
>>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>>>
>>> at
>>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>>>
>>> at
>>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>>>
>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>
>>> at
>>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>>
>>> at
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>>
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>
>>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>>>
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>>
>>> at
>>> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
>>>
>>> at
>>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
>>>
>>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>>>
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> *Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)*
>>>
>>> at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
>>>
>>> at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>>>
>>> at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
>>>
>>> at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
>>>
>>> at
>>> org.xerial.snappy.SnappyInputStream.read

Re: GroupBy on RDD returns empty collection

2015-06-01 Thread Malte
I just ran the same app with limited data on my personal machine - no error.
Seems to be a mesos issue. Will investigate further. If anyone knows
anything, let me know :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GroupBy-on-RDD-returns-empty-collection-tp23105p23107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Join highly skewed datasets

2015-06-01 Thread ๏̯͡๏
We use Scoobi + MR to perform joins and we particularly use blockJoin() API
of scoobi


/** Perform an equijoin with another distributed list where this list is
considerably smaller
* than the right (but too large to fit in memory), and where the keys of
right may be
* particularly skewed. */

 def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A, B))] =
Relational.blockJoin(left, right)


I am trying to do a POC and what Spark join API(s) is recommended to
achieve something similar ?

Please suggest.

-- 
Deepak


SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-01 Thread Haopu Wang
Hi,

I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.

I don't know how to change the replication factor of the generated
parquet files on HDFS.

I tried to set "dfs.replication" on HiveContext but that didn't work.
Any suggestions are appreciated very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
Hi,
small mock data doesn't reproduce the problem. IMHO problem is reproduced
when we make shuffle big enough to split data into disk.
We will work on it to understand and reproduce the problem(not first
priority though...)


On 1 June 2015 at 23:02, Josh Rosen  wrote:

> How much work is to produce a small standalone reproduction?  Can you
> create an Avro file with some mock data, maybe 10 or so records, then
> reproduce this locally?
>
> On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman 
> wrote:
>
>> switching to use simple pojos instead of using avro for spark
>> serialization solved the problem(I mean reading avro from s3 and than
>> mapping each avro object to it's pojo serializable counterpart with same
>> fields, pojo is registered withing kryo)
>> Any thought where to look for a problem/misconfiguration?
>>
>> On 31 May 2015 at 22:48, Igor Berman  wrote:
>>
>>> Hi
>>> We are using spark 1.3.1
>>> Avro-chill (tomorrow will check if its important) we register avro
>>> classes from java
>>> Avro 1.7.6
>>> On May 31, 2015 22:37, "Josh Rosen"  wrote:
>>>
 Which Spark version are you using?  I'd like to understand whether this
 change could be caused by recent Kryo serializer re-use changes in master /
 Spark 1.4.

 On Sun, May 31, 2015 at 11:31 AM, igor.berman 
 wrote:

> after investigation the problem is somehow connected to avro
> serialization
> with kryo + chill-avro(mapping avro object to simple scala case class
> and
> running reduce on these case class objects solves the problem)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>
>


Re: flatMap output on disk / flatMap memory overhead

2015-06-01 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think
StorageLevel MEMORY_AND_DISK means spark will try to keep the data in
memory and if there isn't sufficient space then it will be shipped to the
disk.

Thanks
Best Regards

On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea 
wrote:

> Hi,
>
> Is there any way to force the output RDD of a  flatMap op to be stored in
> both memory and disk as it is computed ? My RAM would not be able to fit
> the
> entire output of flatMap, so it really needs to starts using disk after the
> RAM gets full. I didn't find any way to force this.
>
> Also, what is the memory overhead of flatMap ? From my computations, the
> output RDD should fit in memory, but I get the following error after a
> while
> (and I know it's because of memory issues, since running the program with
> 1/3 of the input data finishes succesfully)
>
> 15/06/01 19:02:49 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
> Could not get block(s) from
> ConnectionManagerId(dco-node036-mgt.dco.ethz.ch,57478)
> java.io.IOException: sendMessageReliably failed because ack was not
> received
> within 60 sec
> at
>
> org.apache.spark.network.ConnectionManager$$anon$10$$anonfun$run$15.apply(ConnectionManager.scala:866)
> at
>
> org.apache.spark.network.ConnectionManager$$anon$10$$anonfun$run$15.apply(ConnectionManager.scala:865)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.network.ConnectionManager$$anon$10.run(ConnectionManager.scala:865)
> at
>
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
> at
>
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
> at
> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Also, I've seen also this:
> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
> but my understanding is that one should apply something like:
> rdd.flatMap(...).persist(MEMORY_AND_DISK) which assumes that the entire
> output of flatMap is first stored in memory (which is not possible in my
> case) and, only when it's done, is stored on the disk. Please correct me if
> I'm wrong.  Anways, I've tried using this , but I got the same error.
>
> My config:
>
> conf.set("spark.cores.max", "128")
> conf.set("spark.akka.frameSize", "1024")
> conf.set("spark.executor.memory", "125g")
> conf.set("spark.shuffle.file.buffer.kb", "1000")
> conf.set("spark.shuffle.consolidateFiles", "true")
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-tp23098.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>