[SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Yi Tian

Hi, all

I have create a JIRA ticket about adding a monitor page for Thrift server.

https://issues.apache.org/jira/browse/SPARK-5100

Anyone could review the design doc, and give some advises?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Guava 11 dependency issue in Spark 1.2.0

2015-01-06 Thread Niranda Perera
Hi,

I have been running a simple Spark app on a local spark cluster and I came
across this error.

Exception in thread main java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at
org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at
org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at
org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:136)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
at
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98)
at
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:695)
at
com.databricks.spark.avro.AvroRelation.buildScan$lzycompute(AvroRelation.scala:45)
at
com.databricks.spark.avro.AvroRelation.buildScan(AvroRelation.scala:44)
at
org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:56)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at
org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114)


While looking into this I found out that Guava was downgraded to version 11
in this PR.
https://github.com/apache/spark/pull/1610

In this PR OpenHashSet.scala:261 line hashInt has been changed to hashLong.
But when I actually run my app,  java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt error occurs,
which is understandable because hashInt is not available before Guava 12.

So, I''m wondering why this occurs?

Cheers
-- 
Niranda Perera


Re: Spark Streaming Data flow graph

2015-01-06 Thread François Garillot
Thanks a LOT for your answer ! I've updated the diagram, at the same
address :
https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0

I've addressed your more straightforward remarks directly in the diagram. A
couple questions:

- the location of instances (Executor, Master, Driver) is now marked, I
hope I didn't make too many mistakes there, did I ?

- Given that the communication between instances and their members (e.g.
ReceiverSupervisor / ReceivedBlockHandler) is willingly omitted, have I
forgotten any communication channels ?

- I've represented some queues / buffers using a red trapezoid. I'm thus
starting an inventory of queues or buffers, and I'm interested in adding
the 'implicit' ones as well (e.g. jobSets in JobScheduler, which is indexed
by time in ms). I'd be happy with pointers on where to look : ideally I'm
trying to see any place in the data flow where data is sitting idle for any
length of time, waiting to be chunked somehow (whether it's at the RDD or
block level doesn't really matter to me, I'm interested in all types of
'chunking').

Naturally, this is intended to be a developer document exclusively (hence
in particular why I'm not publicising this on the user ML).


On Mon, Jan 5, 2015 at 10:57 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Hey François,

 Well, at a high-level here is what I thought about the diagram.

 - ReceiverSupervisor handles only one Receiver.
 - BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler
 - The blocks are inserted in BlockManager and if activated,
 WriteAheadLogManager in parallel, not through BlockManager as the
 diagram seems to imply
 - It would be good to have a clean visual separation of what runs in
 Executor (better term than Worker) and what is in Driver ... Driver
 stuff on left and Executor stuff on right, or vice versa.

 More importantly, the word of caution is that all the internal stuff
 like ReceiverBlockHandler, Supervisor, etc are subject to change any
 time as we keep refactoring stuff. So highlighting these internal
 details too much too publicly may lead to future confusion.

 TD

 On Thu, Dec 18, 2014 at 11:04 AM,  francois.garil...@typesafe.com wrote:
  I’ve been trying to produce an updated box diagram to refresh :
 
 http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26
 
 
  … after the SPARK-3129, and other switches (a surprising number of
 comments still mention NetworkReceiver).
 
 
  Here’s what I have so far:
  https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0
 
 
  This is not supposed to respect any particular convention (ER, ORM, …).
 Data flow up to right before RDD creation is in bold arrows, metadata flow
 is in normal width arrows.
 
 
  This diagram is still very much a WIP (see below : todo), but I wanted
 to share it to ask:
  - what’s wrong ?
  - what are the glaring omissions ?
  - how can I make this better (i.e. what should I add first to the
 Todo-list below) ?
 
 
  I’ll be happy to share this (including sources) with whoever asks for it.
 
 
  Todo :
  - mark private/public classes
  - mark queues in Receiver, ReceivedBlockHandler, BlockManager
  - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo
 
 
 
  —
  François Garillot




-- 
François Garillot


Reading Data Using TextFileStream

2015-01-06 Thread Jeniba Johnson

Hi Hari,

Iam trying to read data from a file which is stored in HDFS. Using Flume the 
data is tailed and stored in HDFS.
Now I want to read this data using TextFileStream. Using the below mentioned 
code Iam not able to fetch the
Data  from a file which is stored in HDFS. Can anyone help me with this issue.

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import com.google.common.collect.Lists;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public final class Test1 {
  public static void main(String[] args) throws Exception {

SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount);
JavaStreamingContext ssc = new 
JavaStreamingContext(local[4],JavaWordCount,  new Duration(2));

JavaDStreamString textStream = 
ssc.textFileStream(user/huser/user/huser/flume);//Data Directory Path in HDFS


JavaDStreamString suspectedStream = textStream.flatMap(new 
FlatMapFunctionString,String()
 {
public IterableString call(String line) throws 
Exception {

//return Arrays.asList(line.toString().toString());
   return  
Lists.newArrayList(line.toString().toString());
 }
 });


suspectedStream.foreach(new FunctionJavaRDDString,Void(){

public Void call(JavaRDDString rdd) throws Exception {
ListString output = rdd.collect();
System.out.println(Sentences Collected from Flume  + output);
   return  null;
}
});

suspectedStream.print();

System.out.println(Welcome TO Flume Streaming);
ssc.start();
ssc.awaitTermination();
  }

}

The command I use is:
./bin/spark-submit --verbose --jars 
lib/spark-examples-1.1.0-hadoop1.0.4.jar,lib/mysql.jar --master local[*] 
--deploy-mode client --class xyz.Test1 bin/filestream3.jar





Regards,
Jeniba Johnson



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. LT Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail


Re: [SPARK-5100][SQL] Spark Thrift server monitor page

2015-01-06 Thread Cheng Lian
Talked with Yi offline, personally I think this feature is pretty 
useful, and the design makes sense, and he's already got a running 
prototype.


Yi, would you mind to open a PR for this? Thanks!

Cheng

On 1/6/15 5:25 PM, Yi Tian wrote:

Hi, all

I have create a JIRA ticket about adding a monitor page for Thrift 
server.


https://issues.apache.org/jira/browse/SPARK-5100

Anyone could review the design doc, and give some advises?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Reading Data Using TextFileStream

2015-01-06 Thread Akhil Das
I think you need to start your streaming job, then put the files there to
get them read. textFileStream doesn't read the existing files i believe.

Also are you sure the path is not the following? (no missing / in the
beginning?)

JavaDStreamString textStream = ssc.textFileStream(/user/
huser/user/huser/flume);


Thanks
Best Regards

On Wed, Jan 7, 2015 at 9:16 AM, Jeniba Johnson 
jeniba.john...@lntinfotech.com wrote:


 Hi Hari,

 Iam trying to read data from a file which is stored in HDFS. Using Flume
 the data is tailed and stored in HDFS.
 Now I want to read this data using TextFileStream. Using the below
 mentioned code Iam not able to fetch the
 Data  from a file which is stored in HDFS. Can anyone help me with this
 issue.

 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.FlatMapFunction;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.streaming.Duration;
 import org.apache.spark.streaming.api.java.JavaDStream;
 import org.apache.spark.streaming.api.java.JavaStreamingContext;

 import com.google.common.collect.Lists;

 import java.util.Arrays;
 import java.util.List;
 import java.util.regex.Pattern;

 public final class Test1 {
   public static void main(String[] args) throws Exception {

 SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount);
 JavaStreamingContext ssc = new
 JavaStreamingContext(local[4],JavaWordCount,  new Duration(2));

 JavaDStreamString textStream =
 ssc.textFileStream(user/huser/user/huser/flume);//Data Directory Path in
 HDFS


 JavaDStreamString suspectedStream = textStream.flatMap(new
 FlatMapFunctionString,String()
  {
 public IterableString call(String line)
 throws Exception {

 //return
 Arrays.asList(line.toString().toString());
return
 Lists.newArrayList(line.toString().toString());
  }
  });


 suspectedStream.foreach(new FunctionJavaRDDString,Void(){

 public Void call(JavaRDDString rdd) throws Exception {
 ListString output = rdd.collect();
 System.out.println(Sentences Collected from Flume  + output);
return  null;
 }
 });

 suspectedStream.print();

 System.out.println(Welcome TO Flume Streaming);
 ssc.start();
 ssc.awaitTermination();
   }

 }

 The command I use is:
 ./bin/spark-submit --verbose --jars
 lib/spark-examples-1.1.0-hadoop1.0.4.jar,lib/mysql.jar --master local[*]
 --deploy-mode client --class xyz.Test1 bin/filestream3.jar





 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Re: jenkins redirect down (but jenkins is up!), lots of potential

2015-01-06 Thread shane knapp
the regular url is working now, thanks for your patience.

On Mon, Jan 5, 2015 at 2:25 PM, Josh Rosen rosenvi...@gmail.com wrote:

 The pull request builder and SCM-polling builds appear to be working fine,
 but the links in pull request comments won't work because the AMP Lab
 webserver is still down.  In the meantime, though, you can continue to
 access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/

 On Mon, Jan 5, 2015 at 10:37 AM, shane knapp skn...@berkeley.edu wrote:

 UC Berkeley had some major maintenance done this past weekend, and long
 story short, not everything came back.  our primary webserver's NFS is
 down
 and that means we're not serving websites, meaning that the redirect to
 jenkins is failing.

 jenkins is still up, and building some jobs, but we will probably see pull
 request builder failures, and other transient issues.  SCM-polling builds
 should be fine.

 there is no ETA on when this will be fixed, but once our
 amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone
 know.
  i'm trying to get more status updates as they come.

 i'm really sorry about the inconvenience.

 shane