[SPARK-5100][SQL] Spark Thrift server monitor page
Hi, all I have create a JIRA ticket about adding a monitor page for Thrift server. https://issues.apache.org/jira/browse/SPARK-5100 Anyone could review the design doc, and give some advises? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Guava 11 dependency issue in Spark 1.2.0
Hi, I have been running a simple Spark app on a local spark cluster and I came across this error. Exception in thread main java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78) at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70) at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:136) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:114) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:787) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:992) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:98) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:945) at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:695) at com.databricks.spark.avro.AvroRelation.buildScan$lzycompute(AvroRelation.scala:45) at com.databricks.spark.avro.AvroRelation.buildScan(AvroRelation.scala:44) at org.apache.spark.sql.sources.DataSourceStrategy$.apply(DataSourceStrategy.scala:56) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.api.java.JavaSchemaRDD.collect(JavaSchemaRDD.scala:114) While looking into this I found out that Guava was downgraded to version 11 in this PR. https://github.com/apache/spark/pull/1610 In this PR OpenHashSet.scala:261 line hashInt has been changed to hashLong. But when I actually run my app, java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt error occurs, which is understandable because hashInt is not available before Guava 12. So, I''m wondering why this occurs? Cheers -- Niranda Perera
Re: Spark Streaming Data flow graph
Thanks a LOT for your answer ! I've updated the diagram, at the same address : https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0 I've addressed your more straightforward remarks directly in the diagram. A couple questions: - the location of instances (Executor, Master, Driver) is now marked, I hope I didn't make too many mistakes there, did I ? - Given that the communication between instances and their members (e.g. ReceiverSupervisor / ReceivedBlockHandler) is willingly omitted, have I forgotten any communication channels ? - I've represented some queues / buffers using a red trapezoid. I'm thus starting an inventory of queues or buffers, and I'm interested in adding the 'implicit' ones as well (e.g. jobSets in JobScheduler, which is indexed by time in ms). I'd be happy with pointers on where to look : ideally I'm trying to see any place in the data flow where data is sitting idle for any length of time, waiting to be chunked somehow (whether it's at the RDD or block level doesn't really matter to me, I'm interested in all types of 'chunking'). Naturally, this is intended to be a developer document exclusively (hence in particular why I'm not publicising this on the user ML). On Mon, Jan 5, 2015 at 10:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey François, Well, at a high-level here is what I thought about the diagram. - ReceiverSupervisor handles only one Receiver. - BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler - The blocks are inserted in BlockManager and if activated, WriteAheadLogManager in parallel, not through BlockManager as the diagram seems to imply - It would be good to have a clean visual separation of what runs in Executor (better term than Worker) and what is in Driver ... Driver stuff on left and Executor stuff on right, or vice versa. More importantly, the word of caution is that all the internal stuff like ReceiverBlockHandler, Supervisor, etc are subject to change any time as we keep refactoring stuff. So highlighting these internal details too much too publicly may lead to future confusion. TD On Thu, Dec 18, 2014 at 11:04 AM, francois.garil...@typesafe.com wrote: I’ve been trying to produce an updated box diagram to refresh : http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26 … after the SPARK-3129, and other switches (a surprising number of comments still mention NetworkReceiver). Here’s what I have so far: https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0 This is not supposed to respect any particular convention (ER, ORM, …). Data flow up to right before RDD creation is in bold arrows, metadata flow is in normal width arrows. This diagram is still very much a WIP (see below : todo), but I wanted to share it to ask: - what’s wrong ? - what are the glaring omissions ? - how can I make this better (i.e. what should I add first to the Todo-list below) ? I’ll be happy to share this (including sources) with whoever asks for it. Todo : - mark private/public classes - mark queues in Receiver, ReceivedBlockHandler, BlockManager - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo — François Garillot -- François Garillot
Reading Data Using TextFileStream
Hi Hari, Iam trying to read data from a file which is stored in HDFS. Using Flume the data is tailed and stored in HDFS. Now I want to read this data using TextFileStream. Using the below mentioned code Iam not able to fetch the Data from a file which is stored in HDFS. Can anyone help me with this issue. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.streaming.Duration; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import com.google.common.collect.Lists; import java.util.Arrays; import java.util.List; import java.util.regex.Pattern; public final class Test1 { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount); JavaStreamingContext ssc = new JavaStreamingContext(local[4],JavaWordCount, new Duration(2)); JavaDStreamString textStream = ssc.textFileStream(user/huser/user/huser/flume);//Data Directory Path in HDFS JavaDStreamString suspectedStream = textStream.flatMap(new FlatMapFunctionString,String() { public IterableString call(String line) throws Exception { //return Arrays.asList(line.toString().toString()); return Lists.newArrayList(line.toString().toString()); } }); suspectedStream.foreach(new FunctionJavaRDDString,Void(){ public Void call(JavaRDDString rdd) throws Exception { ListString output = rdd.collect(); System.out.println(Sentences Collected from Flume + output); return null; } }); suspectedStream.print(); System.out.println(Welcome TO Flume Streaming); ssc.start(); ssc.awaitTermination(); } } The command I use is: ./bin/spark-submit --verbose --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar,lib/mysql.jar --master local[*] --deploy-mode client --class xyz.Test1 bin/filestream3.jar Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: [SPARK-5100][SQL] Spark Thrift server monitor page
Talked with Yi offline, personally I think this feature is pretty useful, and the design makes sense, and he's already got a running prototype. Yi, would you mind to open a PR for this? Thanks! Cheng On 1/6/15 5:25 PM, Yi Tian wrote: Hi, all I have create a JIRA ticket about adding a monitor page for Thrift server. https://issues.apache.org/jira/browse/SPARK-5100 Anyone could review the design doc, and give some advises? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Reading Data Using TextFileStream
I think you need to start your streaming job, then put the files there to get them read. textFileStream doesn't read the existing files i believe. Also are you sure the path is not the following? (no missing / in the beginning?) JavaDStreamString textStream = ssc.textFileStream(/user/ huser/user/huser/flume); Thanks Best Regards On Wed, Jan 7, 2015 at 9:16 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi Hari, Iam trying to read data from a file which is stored in HDFS. Using Flume the data is tailed and stored in HDFS. Now I want to read this data using TextFileStream. Using the below mentioned code Iam not able to fetch the Data from a file which is stored in HDFS. Can anyone help me with this issue. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.streaming.Duration; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import com.google.common.collect.Lists; import java.util.Arrays; import java.util.List; import java.util.regex.Pattern; public final class Test1 { public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount); JavaStreamingContext ssc = new JavaStreamingContext(local[4],JavaWordCount, new Duration(2)); JavaDStreamString textStream = ssc.textFileStream(user/huser/user/huser/flume);//Data Directory Path in HDFS JavaDStreamString suspectedStream = textStream.flatMap(new FlatMapFunctionString,String() { public IterableString call(String line) throws Exception { //return Arrays.asList(line.toString().toString()); return Lists.newArrayList(line.toString().toString()); } }); suspectedStream.foreach(new FunctionJavaRDDString,Void(){ public Void call(JavaRDDString rdd) throws Exception { ListString output = rdd.collect(); System.out.println(Sentences Collected from Flume + output); return null; } }); suspectedStream.print(); System.out.println(Welcome TO Flume Streaming); ssc.start(); ssc.awaitTermination(); } } The command I use is: ./bin/spark-submit --verbose --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar,lib/mysql.jar --master local[*] --deploy-mode client --class xyz.Test1 bin/filestream3.jar Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: jenkins redirect down (but jenkins is up!), lots of potential
the regular url is working now, thanks for your patience. On Mon, Jan 5, 2015 at 2:25 PM, Josh Rosen rosenvi...@gmail.com wrote: The pull request builder and SCM-polling builds appear to be working fine, but the links in pull request comments won't work because the AMP Lab webserver is still down. In the meantime, though, you can continue to access Jenkins through https://hadrian.ist.berkeley.edu/jenkins/ On Mon, Jan 5, 2015 at 10:37 AM, shane knapp skn...@berkeley.edu wrote: UC Berkeley had some major maintenance done this past weekend, and long story short, not everything came back. our primary webserver's NFS is down and that means we're not serving websites, meaning that the redirect to jenkins is failing. jenkins is still up, and building some jobs, but we will probably see pull request builder failures, and other transient issues. SCM-polling builds should be fine. there is no ETA on when this will be fixed, but once our amplab.cs.berkeley.edu/jenkins redir is working, i will let everyone know. i'm trying to get more status updates as they come. i'm really sorry about the inconvenience. shane