Apache Kafka Docker official image

2017-02-16 Thread Gianluca Privitera
Hi, 
I’m currently proposing an official image for Apache Kafka in the Docker 
library ( https://github.com/docker-library/official-images/pull/2627 
<https://github.com/docker-library/official-images/pull/2627> ).
I wanted to know if someone from Kafka upstream is interested in taking over or 
you are ok with me being the maintainer of the image.

Let me know so I can speed up the process of the image approval. 

Thanks

Gianluca Privitera

Re: hive.contrib.serde2.RegexSerDe not found

2015-07-28 Thread Gianluca Privitera
Try use: org.apache.hadoop.hive.serde2.RegexSerDe

GP

On 27 Jul 2015, at 09:35, ZhuGe t...@outlook.commailto:t...@outlook.com 
wrote:

Hi all:
I am testing the performance of hive on spark sql.
The existing table is created with
ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
( 'input.regex' = 
'(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)\\|\\^\\|(.*?)'
,'output.format.string' = '%1$s %2$s %3$s %4$s %5$s %16$s %7$s %8$s %9$s %10$s 
%11$s %12$s %13$s %14$s %15$s %16$s %17$s ')
STORED AS TEXTFILE
location '/data/BaseData/wx/xx/xx/xx/xx';

When i use spark sql(spark-shell) to query the existing table, got exception 
like this:
Caused by: MetaException(message:java.lang.ClassNotFoundException Class 
org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:382)
at 
org.apache.hadoop.hive.ql.metadata.Partition.getDeserializer(Partition.java:249)

I add the jar dependency in the spark-shell command, still do not work.
SPARK_SUBMIT_OPTS=-XX:MaxPermSize=256m ./bin/spark-shell --jars 
/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/hive-contrib-0.13.1-cdh5.2.0.jar,postgresql-9.2-1004-jdbc41.jar

How should i fix the problem?
Cheers



Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
In Spark website it’s stated in the View After the Fact section 
(https://spark.apache.org/docs/latest/monitoring.html) that you can point the 
start-history-server.sh script to a directory in order do view the Web UI using 
the logs as data source.

Is it possible to point that script to S3? Maybe from a EC2 instance? 

Thanks,

Gianluca
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark History Server pointing to S3

2015-06-16 Thread Gianluca Privitera
It gives me an exception with org.apache.spark.deploy.history.FsHistoryProvider 
, a problem with the file system. I can reproduce the exception if you want.
It perfectly works if I give a local path, I tested it in 1.3.0 version.

Gianluca

On 16 Jun 2015, at 15:08, Akhil Das 
ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote:

Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3

Thanks
Best Regards

On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera 
gianluca.privite...@studio.unibo.itmailto:gianluca.privite...@studio.unibo.it
 wrote:
In Spark website it’s stated in the View After the Fact section 
(https://spark.apache.org/docs/latest/monitoring.html) that you can point the 
start-history-server.sh script to a directory in order do view the Web UI using 
the logs as data source.

Is it possible to point that script to S3? Maybe from a EC2 instance?

Thanks,

Gianluca
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Gianluca Privitera
Hi,
I’ve got a problem with Spark Streaming and tshark.
While I’m running locally I have no problems with this code, but when I run it 
on a EC2 cluster I get the exception shown just under the code.

def dissection(s: String): Seq[String] = {
try {

  Process(hadoop command to create ./localcopy.tmp).! // calls hadoop to 
copy a file from s3 locally
  val pb = Process(“tshark … localcopy.tmp”)  // calls tshark to transform 
the s3 file into sequence of strings
  var returnValue = pb.lines_!.toSeq
  return returnValue

} catch {
  case e: Exception =
System.err.println(“ERROR)
return new MutableList[String]()
}
  }

(line 2051 points to the function “dissection”)

WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
at Main$$anonfun$11.apply(Main.scala:2051)
at Main$$anonfun$11.apply(Main.scala:2051)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Has anyone got an idea why that may happen? I’m pretty sure that the hadoop 
call works perfectly.

Thanks
Gianluca


Re: Where Can I find the full documentation for Spark SQL?

2014-06-25 Thread Gianluca Privitera
You can find something in the API, nothing more than that I think for now.

Gianluca 

On 25 Jun 2014, at 23:36, guxiaobo1982 guxiaobo1...@qq.com wrote:

 Hi,
 
 I want to know the full list of functions, syntax, features that Spark SQL 
 supports, is there some documentations.
 
 
 Regards,
 
 Xiaobo Gu



Re: Access DStream content?

2014-06-12 Thread Gianluca Privitera
You can use ForeachRDD then access RDD data.
Hope this works for you.

Gianluca

On 12 Jun 2014, at 10:06, Wolfinger, Fred 
fwolfin...@cyberpointllc.commailto:fwolfin...@cyberpointllc.com wrote:

Good morning.

I have a question related to Spark Streaming. I have reduced some data down to 
a simple count value (by window), and would like to take immediate action on 
that value before storing in HDFS. I don't see any DStream member functions 
that would allow me to access its contents. Is what I am trying to do not in 
the spirit of Spark Streaming? If it's not, what would be the best practice for 
doing so?

Thanks so much!

Fred




Re: Increase storage.MemoryStore size

2014-06-12 Thread Gianluca Privitera
If you are launching your application with spark-submit you can manually edit 
the spark-class file to make it 1g as baseline. 
It’s pretty easy to do and to figure out how once you open the file.
This worked for me even if it’s not a final solution of course.

Gianluca

On 12 Jun 2014, at 15:16, ericjohnston1989 ericjohnston1...@gmail.com wrote:

 Hey everyone,
 
 I'm having some trouble increasing the default storage size for a broadcast
 variable. It looks like it defaults to a little less than 512MB every time,
 and I can't figure out which configuration to change to increase this.
 
 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory
 (estimated size 426.5 MB, free 64.2 MB)
 
 (I'm seeing this in the terminal on my driver computer)
 
 I can change spark.executor.memory, and that seems to increase the amount
 of RAM available on my nodes, but it doesn't seem to adjust this storage
 size for my broadcast variables. Any ideas?
 
 Thanks,
 
 Eric
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Increase-storage-MemoryStore-size-tp7516.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Spark Streaming application not working on EC2 Cluster

2014-06-09 Thread Gianluca Privitera
Hi,
I'm think I may have encountered some kind of bug that at the moment prevents 
the correct running of my application on a EC2 Cluster.
I'm saying that because the same exact code works wonderfully locally but has a 
really strange behaviour on the cluster.
val uri = ssc.textFileStream(args(1) + /inputData/newData/)
uri.print() // prints perfectly the data
uri.saveAsTextFiles((args(1) + /uri/textFiles/), ) // saves the file as 
intended
val downloaded = uri.map(s = download(s))
  .flatMap(t = t).map(t = createKey(t))
val downloadedAndFiltered = downloaded.filter(t = filterEcho(t))
downloadedAndFilteredAndEchoFiltered(t = t._2).saveAsTextFiles((args(1) + 
/dissected/textFiles/), )
// saves an empty file(why???)
downloadedAndFilteredAndEchoFiltered.print() // prints perfectly
// from now all data gets lost, any further call on 
downloadedAndFilteredAndEchoFiltered DStream receive empty input

I have no idea of what it could be that breaks down my application, someone 
knows if there are some known bug?

Gianluca


Re: Spark Streaming, download a s3 file to run a script shell on it

2014-06-06 Thread Gianluca Privitera

Where are the API for QueueStream and RddQueue?
In my solution I cannot open a DStream with S3 location because I need 
to run a script on the file (a script that unluckily doesn't accept 
stdin as input), so I have to download it on my disk somehow than handle 
it from there before creating the stream.


Thanks
Gianluca

On 06/06/2014 02:19, Mayur Rustagi wrote:
You can look to create a Dstream directly from S3 location using file 
stream. If you want to use any specific logic you can rely on 
Queuestream  read data yourself from S3, process it  push it into 
RDDQueue.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Fri, Jun 6, 2014 at 3:00 AM, Gianluca Privitera 
gianluca.privite...@studio.unibo.it 
mailto:gianluca.privite...@studio.unibo.it wrote:


Hi,
I've got a weird question but maybe someone has already dealt with it.
My Spark Streaming application needs to
- download a file from a S3 bucket,
- run a script with the file as input,
- create a DStream from this script output.
I've already got the second part done with the rdd.pipe() API that
really fits my request, but I have no idea how to manage the first
part.
How can I manage to download a file and run a script on them
inside a Spark Streaming Application?
Should I use process() from Scala or it won't work?

Thanks
Gianluca






Spark Streaming window functions bug 1.0.0

2014-06-06 Thread Gianluca Privitera

Is anyone experiencing problems with windows?

dstream1.print()
val dstream2 = dstream1.groupByKeyAndWindow(Seconds(60))
dstream2.print()

In my appslication the first print() prints out all the strings and 
their keys, but after the window function everything is lost and 
nothings gets printed.

I'm using Spark version 1.0.0 on a EC2 Cluster.

Thanks
Gianluca


Spark Streaming, download a s3 file to run a script shell on it

2014-06-05 Thread Gianluca Privitera

Hi,
I've got a weird question but maybe someone has already dealt with it.
My Spark Streaming application needs to
- download a file from a S3 bucket,
- run a script with the file as input,
- create a DStream from this script output.
I've already got the second part done with the rdd.pipe() API that 
really fits my request, but I have no idea how to manage the first part.
How can I manage to download a file and run a script on them inside a 
Spark Streaming Application?

Should I use process() from Scala or it won't work?

Thanks
Gianluca



EC2 Simple Cluster

2014-06-02 Thread Gianluca Privitera

Hi everyone,
I would like to setup a very simple cluster (specifically using 2 micro 
instances only) of Spark on EC2 and make it run a simple Spark Streaming 
application I created.

Someone actually managed to do that?
Because after launching the scripts from this page: 
http://spark.apache.org/docs/0.9.1/ec2-scripts.html and logging into the 
master node, I cannot find the spark folder the page is talking about, 
so I suppose the launch didn't go well.


Thank you
Gianluca