Apache Kafka Docker official image
Hi, I’m currently proposing an official image for Apache Kafka in the Docker library ( https://github.com/docker-library/official-images/pull/2627 <https://github.com/docker-library/official-images/pull/2627> ). I wanted to know if someone from Kafka upstream is interested in taking over or you are ok with me being the maintainer of the image. Let me know so I can speed up the process of the image approval. Thanks Gianluca Privitera
Re: hive.contrib.serde2.RegexSerDe not found
Try use: org.apache.hadoop.hive.serde2.RegexSerDe GP On 27 Jul 2015, at 09:35, ZhuGe t...@outlook.commailto:t...@outlook.com wrote: Hi all: I am testing the performance of hive on spark sql. The existing table is created with ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)' ,'output.format.string' = '%1$s %2$s %3$s %4$s %5$s %16$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s ') STORED AS TEXTFILE location '/data/BaseData/wx/xx/xx/xx/xx'; When i use spark sql(spark-shell) to query the existing table, got exception like this: Caused by: MetaException(message:java.lang.ClassNotFoundException Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:382) at org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:249) I add the jar dependency in the spark-shell command, still do not work. SPARK_SUBMIT_OPTS=-XX:MaxPermSize=256m ./bin/spark-shell --jars /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/hive-contrib-0.13.1-cdh5.2.0.jar,postgresql-9.2-1004-jdbc41.jar How should i fix the problem? Cheers
Spark History Server pointing to S3
In Spark website it’s stated in the View After the Fact section (https://spark.apache.org/docs/latest/monitoring.html) that you can point the start-history-server.sh script to a directory in order do view the Web UI using the logs as data source. Is it possible to point that script to S3? Maybe from a EC2 instance? Thanks, Gianluca - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark History Server pointing to S3
It gives me an exception with org.apache.spark.deploy.history.FsHistoryProvider , a problem with the file system. I can reproduce the exception if you want. It perfectly works if I give a local path, I tested it in 1.3.0 version. Gianluca On 16 Jun 2015, at 15:08, Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote: Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera gianluca.privite...@studio.unibo.itmailto:gianluca.privite...@studio.unibo.it wrote: In Spark website it’s stated in the View After the Fact section (https://spark.apache.org/docs/latest/monitoring.html) that you can point the start-history-server.sh script to a directory in order do view the Web UI using the logs as data source. Is it possible to point that script to S3? Maybe from a EC2 instance? Thanks, Gianluca - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Spark Streaming w/ tshark exception problem on EC2
Hi, I’ve got a problem with Spark Streaming and tshark. While I’m running locally I have no problems with this code, but when I run it on a EC2 cluster I get the exception shown just under the code. def dissection(s: String): Seq[String] = { try { Process(hadoop command to create ./localcopy.tmp).! // calls hadoop to copy a file from s3 locally val pb = Process(“tshark … localcopy.tmp”) // calls tshark to transform the s3 file into sequence of strings var returnValue = pb.lines_!.toSeq return returnValue } catch { case e: Exception = System.err.println(“ERROR) return new MutableList[String]() } } (line 2051 points to the function “dissection”) WARN scheduler.TaskSetManager: Loss was due to java.lang.ExceptionInInitializerError java.lang.ExceptionInInitializerError at Main$$anonfun$11.apply(Main.scala:2051) at Main$$anonfun$11.apply(Main.scala:2051) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Has anyone got an idea why that may happen? I’m pretty sure that the hadoop call works perfectly. Thanks Gianluca
Re: Where Can I find the full documentation for Spark SQL?
You can find something in the API, nothing more than that I think for now. Gianluca On 25 Jun 2014, at 23:36, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, I want to know the full list of functions, syntax, features that Spark SQL supports, is there some documentations. Regards, Xiaobo Gu
Re: Access DStream content?
You can use ForeachRDD then access RDD data. Hope this works for you. Gianluca On 12 Jun 2014, at 10:06, Wolfinger, Fred fwolfin...@cyberpointllc.commailto:fwolfin...@cyberpointllc.com wrote: Good morning. I have a question related to Spark Streaming. I have reduced some data down to a simple count value (by window), and would like to take immediate action on that value before storing in HDFS. I don't see any DStream member functions that would allow me to access its contents. Is what I am trying to do not in the spirit of Spark Streaming? If it's not, what would be the best practice for doing so? Thanks so much! Fred
Re: Increase storage.MemoryStore size
If you are launching your application with spark-submit you can manually edit the spark-class file to make it 1g as baseline. It’s pretty easy to do and to figure out how once you open the file. This worked for me even if it’s not a final solution of course. Gianluca On 12 Jun 2014, at 15:16, ericjohnston1989 ericjohnston1...@gmail.com wrote: Hey everyone, I'm having some trouble increasing the default storage size for a broadcast variable. It looks like it defaults to a little less than 512MB every time, and I can't figure out which configuration to change to increase this. INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 426.5 MB, free 64.2 MB) (I'm seeing this in the terminal on my driver computer) I can change spark.executor.memory, and that seems to increase the amount of RAM available on my nodes, but it doesn't seem to adjust this storage size for my broadcast variables. Any ideas? Thanks, Eric -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Increase-storage-MemoryStore-size-tp7516.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark Streaming application not working on EC2 Cluster
Hi, I'm think I may have encountered some kind of bug that at the moment prevents the correct running of my application on a EC2 Cluster. I'm saying that because the same exact code works wonderfully locally but has a really strange behaviour on the cluster. val uri = ssc.textFileStream(args(1) + /inputData/newData/) uri.print() // prints perfectly the data uri.saveAsTextFiles((args(1) + /uri/textFiles/), ) // saves the file as intended val downloaded = uri.map(s = download(s)) .flatMap(t = t).map(t = createKey(t)) val downloadedAndFiltered = downloaded.filter(t = filterEcho(t)) downloadedAndFilteredAndEchoFiltered(t = t._2).saveAsTextFiles((args(1) + /dissected/textFiles/), ) // saves an empty file(why???) downloadedAndFilteredAndEchoFiltered.print() // prints perfectly // from now all data gets lost, any further call on downloadedAndFilteredAndEchoFiltered DStream receive empty input I have no idea of what it could be that breaks down my application, someone knows if there are some known bug? Gianluca
Re: Spark Streaming, download a s3 file to run a script shell on it
Where are the API for QueueStream and RddQueue? In my solution I cannot open a DStream with S3 location because I need to run a script on the file (a script that unluckily doesn't accept stdin as input), so I have to download it on my disk somehow than handle it from there before creating the stream. Thanks Gianluca On 06/06/2014 02:19, Mayur Rustagi wrote: You can look to create a Dstream directly from S3 location using file stream. If you want to use any specific logic you can rely on Queuestream read data yourself from S3, process it push it into RDDQueue. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera gianluca.privite...@studio.unibo.it mailto:gianluca.privite...@studio.unibo.it wrote: Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API that really fits my request, but I have no idea how to manage the first part. How can I manage to download a file and run a script on them inside a Spark Streaming Application? Should I use process() from Scala or it won't work? Thanks Gianluca
Spark Streaming window functions bug 1.0.0
Is anyone experiencing problems with windows? dstream1.print() val dstream2 = dstream1.groupByKeyAndWindow(Seconds(60)) dstream2.print() In my appslication the first print() prints out all the strings and their keys, but after the window function everything is lost and nothings gets printed. I'm using Spark version 1.0.0 on a EC2 Cluster. Thanks Gianluca
Spark Streaming, download a s3 file to run a script shell on it
Hi, I've got a weird question but maybe someone has already dealt with it. My Spark Streaming application needs to - download a file from a S3 bucket, - run a script with the file as input, - create a DStream from this script output. I've already got the second part done with the rdd.pipe() API that really fits my request, but I have no idea how to manage the first part. How can I manage to download a file and run a script on them inside a Spark Streaming Application? Should I use process() from Scala or it won't work? Thanks Gianluca
EC2 Simple Cluster
Hi everyone, I would like to setup a very simple cluster (specifically using 2 micro instances only) of Spark on EC2 and make it run a simple Spark Streaming application I created. Someone actually managed to do that? Because after launching the scripts from this page: http://spark.apache.org/docs/0.9.1/ec2-scripts.html and logging into the master node, I cannot find the spark folder the page is talking about, so I suppose the launch didn't go well. Thank you Gianluca