Re: S3 Zip File Loading Advice

2016-03-08 Thread Hemant Bhanawat
https://issues.apache.org/jira/browse/SPARK-3586 talks about creating a file dstream which can monitor for new files recursively but this functionality is not yet added. I don't see an easy way out. You will have to create your folders based on timeline (looks like you are already doing that) and

Re: updating the Books section on the Spark documentation page

2016-03-08 Thread Jan Štěrba
You could try creating a pull-request on github. -Jan -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Wed, Mar 9, 2016 at 2:45 AM, Mohammed Guller wrote: > Hi - > > > > The Spark documentation page

Re: Saving multiple outputs in the same job

2016-03-08 Thread Jan Štěrba
Hi Andy, its nice to see that we are not the only ones with the same issues. So far we have not gone as far as you have. What we have done is that we cache whatever dataframes/rdds are shared foc computing different output. This has brought us quite the speedup, but we still see that saving some

RE: How to add a custom jar file to the Spark driver?

2016-03-08 Thread Wang, Daoyuan
Hi Gerhard, How does EMR set its conf for spark? I think if you set SPARK_CLASSPATH and spark.dirver.extraClassPath, spark would ignore SPARK_CLASSPATH. I think you can do this by read the configuration from SparkConf, and then add your custom settings to the corresponding key, and use the

reading the parquet file

2016-03-08 Thread Angel Angel
Hello Sir/Madam, I writing the spark application in spark 1.4.0. I have one text file with the size of 8 GB. I save that file in parquet format val df2 = sc.textFile("/root/Desktop/database_200/database_200.txt").map(_.split(",")).map(p => Table(p(0),p(1).trim.toInt, p(2).trim.toInt,

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
I don't just want to replicate all Cached Blocks. I am trying to find a way to solve the issue which i mentioned above mail. Having replicas for all cached blocks will add more cost to customers. On Wed, Mar 9, 2016 at 9:50 AM, Reynold Xin wrote: > You just want to be

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Saurabh Bajaj
You can call *foreachRDD*(*func*) on the output from the final stage, then check the time if it's the 15th min of an hour then you flush the output to DB else you don't. Let me know if that approach works. On Tue, Mar 8, 2016 at 2:10 PM, ayan guha wrote: > Yes if it falls

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
>From cassandra.yaml : native_transport_port: 9042 FYI On Tue, Mar 8, 2016 at 9:13 PM, Saurabh Bajaj wrote: > Hi Andy, > > I believe you need to set the host and port settings separately > spark.cassandra.connection.host > spark.cassandra.connection.port > >

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Saurabh Bajaj
Hi Andy, I believe you need to set the host and port settings separately spark.cassandra.connection.host spark.cassandra.connection.port https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters Looking at the logs, it seems your port

S3 Zip File Loading Advice

2016-03-08 Thread Benjamin Kim
I am wondering if anyone can help. Our company stores zipped CSV files in S3, which has been a big headache from the start. I was wondering if anyone has created a way to iterate through several subdirectories (s3n://events/2016/03/01/00, s3n://2016/03/01/01, etc.) in S3 to find the newest

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Joseph wrote: > Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler

Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
Hi All, When a Spark Job is running, and one of the Spark Executor on Node A has some partitions cached. Later for some other stage, Scheduler tries to assign a task to Node A to process a cached partition (PROCESS_LOCAL). But meanwhile the Node A is occupied with some other tasks and got

Re: Confusing RDD function

2016-03-08 Thread Hemminger Jeff
Thank you, yes that makes sense. I was aware of transformations and actions, but did not realize foreach was an action. I've found the exhaustive list here http://spark.apache.org/docs/latest/programming-guide.html#actions and it's clear to me again. Thank you for your help! On Wed, Mar 9, 2016

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
Hi Ted I believe by default cassandra listens on 9042 From: Ted Yu Date: Tuesday, March 8, 2016 at 6:11 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: pyspark spark-cassandra-connector java.io.IOException:

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Ted Yu
Have you contacted spark-cassandra-connector related mailing list ? I wonder where the port 9042 came from. Cheers On Tue, Mar 8, 2016 at 6:02 PM, Andy Davidson wrote: > > I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python > notebook that reads

pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-08 Thread Andy Davidson
I am using spark-1.6.0-bin-hadoop2.6. I am trying to write a python notebook that reads a data frame from Cassandra. I connect to cassadra using an ssh tunnel running on port 9043. CQLSH works how ever I can not figure out how to configure my notebook. I have tried various hacks any idea what I

updating the Books section on the Spark documentation page

2016-03-08 Thread Mohammed Guller
Hi - The Spark documentation page (http://spark.apache.org/documentation.html) has links to books covering Spark. What is the process for adding a new book to that list? Thanks, Mohammed Author: Big Data Analytics with

Confusing RDD function

2016-03-08 Thread Hemminger Jeff
I'm currently developing a Spark Streaming application. I have a function that receives an RDD and an object instance as a parameter, and returns an RDD: def doTheThing(a: RDD[A], b: B): RDD[C] Within the function, I do some processing within a map of the RDD. Like this: def doTheThing(a:

Saving multiple outputs in the same job

2016-03-08 Thread Andy Sloane
We have a somewhat complex pipeline which has multiple output files on HDFS, and we'd like the materialization of those outputs to happen concurrently. Internal to Spark, any "save" call creates a new "job", which runs synchronously -- that is, the line of code after your save() executes once the

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
I agree it is a useful layer and during my investigations in to individual user connections from a spark application I was running some tests with HiveServer2 and using Beeline I was able to authenticate the users passed in correctly but when it came down to authorizing the queries on the

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Mich Talebzadeh
The current scenario resembles a three tier architecture but without the security of second tier. In a typical three-tier you have users connecting to the application server (read Hive server2) are independently authenticated and if OK, the second tier creates new ,NET type or JDBC threads to

SparkML. RandomForest scalability question.

2016-03-08 Thread Eugene Morozov
Hi, I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers (also have 3 colocated hdfs datanodes). Each worker has only 2 cores and spark.executor.memory is 2.3g. Input file is two hdfs blocks, one block configured = 64MB. I train random forest regression with numTrees=50 and

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
Yes, when creating a Hive Context a Hive Metastore client should be created with a user that the Spark application will talk to the *remote* Hive Metastore with. We would like to add a custom authorization plugin to our remote Hive Metastore to authorize the query requests that the spark

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Mich Talebzadeh
Hi, What do you mean by Hive Metastore Client? Are you referring to Hive server login much like beeline? Spark uses hive-site.xml to get the details of Hive metastore and the login to the metastore which could be any database. Mine is Oracle and as far as I know even in Hive 2, hive-site.xml

Re: Installing Spark on Mac

2016-03-08 Thread Jakob Odersky
I've had some issues myself with the user-provided-Hadoop version. If you simply just want to get started, I would recommend downloading Spark (pre-built, with any of the hadoop versions) as Cody suggested. A simple step-by-step guide: 1. curl

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
http://spark.apache.org/downloads.html Make sure you selected Choose a package type: something that says pre-built In my case, spark-1.6.0-bin-hadoop2.4.tgz bash-3.2$ cd ~/Downloads/ bash-3.2$ tar -xzvf spark-1.6.0-bin-hadoop2.4.tgz bash-3.2$ cd spark-1.6.0-bin-hadoop2.4/ bash-3.2$

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread jleaniz
You've got to start the shuffle service on all your workers. There's a script for that in the 'sbin' directory. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-dynamic-allocation-and-shuffle-service-in-Standalone-Mode-tp26430p26434.html Sent from the

Streaming job delays

2016-03-08 Thread jleaniz
Hi, I have a streaming application that reads batches from Flume, does some transformations and then writes parquet files to HDFS. The problem I have right now is that the scheduling delays are really really high, and get even higher as time goes. Have seen it go up to 24 hours. The processing

Hive Context: Hive Metastore Client

2016-03-08 Thread Alex F
As of Spark 1.6.0 it is now possible to create new Hive Context sessions sharing various components but right now the Hive Metastore Client is shared amongst each new Hive Context Session. Are there any plans to create individual Metastore Clients for each Hive Context? Related to the question

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Yes if it falls within the batch. But if the requirement is flush everything till 15th min of the hour, then it should work. On 9 Mar 2016 04:01, "Ted Yu" wrote: > That may miss the 15th minute of the hour (with non-trivial deviation), > right ? > > On Tue, Mar 8, 2016 at

Re: Installing Spark on Mac

2016-03-08 Thread Aida Tefera
Ok, once I downloaded the pre built version, I created a directory for it and named Spark When I try ./bin/start-all.sh It comes back with : no such file or directory When I try ./bin/spark-shell --master local[2] I get: no such file or directory Failed to find spark assembly, you need to

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
That's what I'm saying, there is no "installing" necessary for pre-built packages. Just unpack it and change directory into it. What happens when you do ./bin/spark-shell --master local[2] or ./bin/start-all.sh On Tue, Mar 8, 2016 at 3:45 PM, Aida Tefera wrote: >

Re: Installing Spark on Mac

2016-03-08 Thread Aida Tefera
Hi Cody, thanks for your reply I tried "sbt/sbt clean assembly" in the Terminal; somehow I still end up with errors. I have looked at the below links, doesn't give much detail on how to install it before executing "./sbin/start-master.sh" Thanks, Aida Sent from my iPhone > On 8 Mar 2016, at

How to add a custom jar file to the Spark driver?

2016-03-08 Thread Gerhard Fiedler
We're running Spark 1.6.0 on EMR, in YARN client mode. We run Python code, but we want to add a custom jar file to the driver. When running on a local one-node standalone cluster, we just use spark.driver.extraClassPath and everything works: spark-submit --conf

Re: Best practices of maintaining a long running SparkContext

2016-03-08 Thread Mich Talebzadeh
Hi, I have recently started experimenting with Zeppelin and run it on TCP port 21999 (configurable in zeppelin-env.sh). The daemon seems to be stable. However, I have noticed that it goes stale from time to time and also killing the UI does not stop the job properly. Sometime it is also necessary

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
Just for the heck of it I tried the old MLlib implementation, but it had the same scalability problem. Anyone familiar with the logistic regression implementation who could weigh in? On Mon, Mar 7, 2016 at 5:35 PM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > We're using

Re: Best practices of maintaining a long running SparkContext

2016-03-08 Thread Zhong Wang
+spark-users We are using Zeppelin (http://zeppelin.incubator.apache.org) as our UI to run spark jobs. Zeppelin maintains a long running SparkContext, and we run into a couple of issues: -- 1. Dynamic resource allocation keeps removing and registering executors, even though no jobs are running 2.

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval Itzchakov
Great. Thanks a lot Silvio. On Tue, Mar 8, 2016, 21:39 Andrew Or wrote: > Hi Yuval, if you start the Workers with `spark.shuffle.service.enabled = > true` then the workers will each start a shuffle service automatically. No > need to start the shuffle services yourself

Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or
Hi Patrick, I think he means just write `/tmp/sparkserverlog` instead of `file:/tmp/sparkserverlog`. However, I think both should work. What mode are you running in, client mode (the default) or cluster mode? If the latter your driver will be run on the cluster, and so your event logs won't be on

Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
Based on your code: sparkContext.addFile("/home/files/data.txt"); List file =sparkContext.textFile(SparkFiles.get("data.txt")).collect(); I’m assuming the file in “/home/files/data.txt” exists and is readable in the driver’s filesystem. Did you try just doing this: List file

RE: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Silvio Fiorito
There’s a script to start it up under sbin, start-shuffle-service.sh. Run that on each of your worker nodes. From: Yuval Itzchakov Sent: Tuesday, March 8, 2016 2:17 PM To: Silvio Fiorito;

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval Itzchakov
Actually, I assumed that setting the flag in the spark job would turn on the shuffle service in the workers. I now understand that assumption was wrong. Is there any way to set the flag via the driver? Or must I manually set it via spark-env.sh on each worker? On Tue, Mar 8, 2016, 20:14 Silvio

Re: Installing Spark on Mac

2016-03-08 Thread Cody Koeninger
You said you downloaded a prebuilt version. You shouldn't have to mess with maven or building spark at all. All you need is a jvm, which it looks like you already have installed. You should be able to follow the instructions at http://spark.apache.org/docs/latest/ and

Not able to save data after running fpgrowth in pyspark

2016-03-08 Thread goutham koneru
Hello, I am using fpgrowth to generate frequent item sets and the model is working fine. If I select n rows I was able to see the data. When I try to save the data using any of the methods like write.orc or saveAsTable or saveAsParquet it is taking unusual amount of time to save the data. If I

Re: Installing Spark on Mac

2016-03-08 Thread Eduardo Costa Alfaia
Hi Aida, The installation has detected a maven version 3.0.3. Update to 3.3.3 and try again. Il 08/Mar/2016 14:06, "Aida" ha scritto: > Hi all, > > Thanks everyone for your responses; really appreciate it. > > Eduardo - I tried your suggestions but ran into some issues,

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
This is in active development, so there is not much that can be done from an end user perspective. In particular the only sink that is available in apache/master is a testing sink that just stores the data in memory. We are working on a parquet based file sink and will eventually support all the

Re: Installing Spark on Mac

2016-03-08 Thread Aida
tried sbt/sbt package; seemed to run fine until it didn't, was wondering whether the below error has to do with my JVM version. Any thoughts? Thanks ukdrfs01:~ aidatefera$ cd Spark ukdrfs01:Spark aidatefera$ cd spark-1.6.0 ukdrfs01:spark-1.6.0 aidatefera$ sbt/sbt package NOTE: The sbt/sbt script

Re: Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Tristan Nixon
this is a bit strange, because you’re trying to create an RDD inside of a foreach function (the jsonElements). This executes on the workers, and so will actually produce a different instance in each JVM on each worker, not one single RDD referenced by the driver, which is what I think you’re

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Suniti Singh
Please check the document for the configuration - http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup On Tue, Mar 8, 2016 at 10:14 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You’ve started the external shuffle service on all worker nodes, correct? >

RE: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Silvio Fiorito
You’ve started the external shuffle service on all worker nodes, correct? Can you confirm they’re still running and haven’t exited? From: Yuval.Itzchakov Sent: Tuesday, March 8, 2016 12:41 PM To: user@spark.apache.org Subject: Using

Not able to write data after running fpgrowth in pyspark

2016-03-08 Thread goutham koneru
Hello, I am using fpgrowth to generate frequent item sets and the model is working fine. If I select n rows I was able to see the data. When I try to save the data using any of the methods like write.orc or saveAsTable or saveAsParquet it is taking unusual amount of time to save the data. If I

Re: Installing Spark on Mac

2016-03-08 Thread Aida
Hi all, Thanks everyone for your responses; really appreciate it. Eduardo - I tried your suggestions but ran into some issues, please see below: ukdrfs01:Spark aidatefera$ cd spark-1.6.0 ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package Using `mvn` from path: /usr/bin/mvn

Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Nesrine BEN MUSTAPHA
Hello, I tried to use sparkSQL to analyse json data streams within a standalone application. here the code snippet that receive the streaming data: *final JavaReceiverInputDStream lines = streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]), StorageLevel.MEMORY_AND_DISK_SER_2());*

Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval.Itzchakov
Hi, I'm using Spark 1.6.0, and according to the documentation, dynamic allocation and spark shuffle service should be enabled. When I submit a spark job via the following: spark-submit \ --master \ --deploy-mode cluster \ --executor-cores 3 \ --conf

Re: PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
Meant to include: I have this function which seems to work, but I am not sure if it is always correct: def octet_length(s): return len(s.encode(‘utf8’)) sqlContext.registerFunction('octet_length', lambda x: octet_length(x)) > On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News) >

PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
I am trying to define a UDF to calculate octet_length of a string but I am having some trouble getting it right. Does anyone have a working version of this already/any pointers? I am using Spark 1.5.2/Python 2.7. Thanks - To

Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary since HDFS already replicates blocks on multiple nodes. On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov wrote: > Parallel disk IO? But the effect should be less noticeable compared to > Hadoop which

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread Ted Yu
That may miss the 15th minute of the hour (with non-trivial deviation), right ? On Tue, Mar 8, 2016 at 8:50 AM, ayan guha wrote: > Why not compare current time in every batch and it meets certain condition > emit the data? > On 9 Mar 2016 00:19, "Abhishek Anand"

Re: SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread Tristan Nixon
My understanding of the model is that you’re supposed to execute SparkFiles.get(…) on each worker node, not on the driver. Since you already know where the files are on the driver, if you want to load these into an RDD with SparkContext.textFile, then this will distribute it out to the

Re: Output the data to external database at particular time in spark streaming

2016-03-08 Thread ayan guha
Why not compare current time in every batch and it meets certain condition emit the data? On 9 Mar 2016 00:19, "Abhishek Anand" wrote: > I have a spark streaming job where I am aggregating the data by doing > reduceByKeyAndWindow with inverse function. > > I am keeping

Re: how to implement and deploy robust streaming apps

2016-03-08 Thread Xinh Huynh
If you would like an overview of Spark Stream and fault tolerance, these slides are great (Slides 24+ focus on fault tolerance; Slide 52 is on resilience to traffic spikes): http://www.lightbend.com/blog/four-things-to-know-about-reliable-spark-streaming-typesafe-databricks This recent Spark

Re: Quetions about Actor model of Computation.

2016-03-08 Thread Minglei Zhang
thanks Ted gives a quick reply. I will see it you mentioned. Best Regards. 2016-03-08 23:26 GMT+08:00 Ted Yu : > This seems related: > > the second paragraph under Implementation and theory > https://en.wikipedia.org/wiki/Closure_(computer_programming) > > On Tue, Mar 8,

Re: Spark on RAID

2016-03-08 Thread Alex Kozlov
Parallel disk IO? But the effect should be less noticeable compared to Hadoop which reads/writes a lot. Much depends on how often Spark persists on disk. Depends on the specifics of the RAID controller as well. If you write to HDFS as opposed to local file system this may be a big factor as

Spark on RAID

2016-03-08 Thread Eddie Esquivel
Hello All, In the Spark documentation under "Hardware Requirements" it very clearly states: We recommend having *4-8 disks* per node, configured *without* RAID (just as separate mount points) My question is why not raid? What is the argument\reason for not using Raid? Thanks! -Eddie

FileAlreadyExistsException and Streaming context

2016-03-08 Thread Peter Halliday
I’m getting a FileAlreadyExistsException. I’ve tired setting the save to SaveMode.Overwrite, and setting spark.hadooop.validateOutputSpecs to false. However, I am wonder if these settings are being ignored, because I’m using Spark Streaming. We aren’t using checkpointing though. Here’s the

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen, I don't really know. I think TD or Michael should know as they personally involved in the task (as far as I could figure it out from the JIRA and the changes). Ping people on the JIRA so they notice your question(s). Pozdrawiam, Jacek Laskowski

Re: Quetions about Actor model of Computation.

2016-03-08 Thread Ted Yu
This seems related: the second paragraph under Implementation and theory https://en.wikipedia.org/wiki/Closure_(computer_programming) On Tue, Mar 8, 2016 at 4:49 AM, Minglei Zhang wrote: > hello, experts. > > I am a student. and recently, I read a paper about *Actor

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the serializer return option or something. The new consumer builds a buffer full of records, not one at a time. On Mar 8, 2016 4:43 AM, "Marius Soutier" wrote: > > > On 04.03.2016, at 22:39, Cody Koeninger

Re: Spark ML Interaction

2016-03-08 Thread Nick Pentreath
Could you create a JIRA to add an example and documentation? Thanks On Tue, 8 Mar 2016 at 16:18, amarouni wrote: > Hi, > > Did anyone here manage to write an example of the following ML feature > transformer > >

Spark ML Interaction

2016-03-08 Thread amarouni
Hi, Did anyone here manage to write an example of the following ML feature transformer http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html ? It's not documented on the official Spark ML features pages but it can be found in the package API javadocs. Thanks,

[Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-08 Thread diplomatic Guru
Hello all, I'm using Random Forest for my machine learning (batch), I would like to use online prediction using Streaming job. However, the document only states linear algorithm for regression job. Could we not use other algorithms?

Output the data to external database at particular time in spark streaming

2016-03-08 Thread Abhishek Anand
I have a spark streaming job where I am aggregating the data by doing reduceByKeyAndWindow with inverse function. I am keeping the data in memory for upto 2 hours and In order to output the reduced data to an external storage I conditionally need to puke the data to DB say at every 15th minute of

Quetions about Actor model of Computation.

2016-03-08 Thread Minglei Zhang
hello, experts. I am a student. and recently, I read a paper about *Actor Model of Computation:Scalable Robust Information System*. In the paper I am trying to understand it's concept. But with the following sentence makes me confusing. # Messages are the unit of communication *1* Reference

Re: Spark structured streaming

2016-03-08 Thread Praveen Devarao
Thanks Jacek for the pointer. Any idea which package can be used in .format(). The test cases seem to work out of the DefaultSource class defined within the DataFrameReaderWriterSuite [ org.apache.spark.sql.streaming.test.DefaultSource] Thanking You

SparkFiles.get() returns with driver path Instead of Worker Path

2016-03-08 Thread ashikvc
I am trying to play a little bit with apache-spark cluster mode. So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine). I send a textfile using `sparkContext.addFile(filepath)` where the filepath is the path of my text file in local machine

Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Steve Loughran
On 8 Mar 2016, at 07:23, Lu, Yingqi > wrote: Thank you for the quick reply. I am very new to maven and always use the default settings. Can you please be a little more specific on the instructions? I think all the jar files from Hadoop build are

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen, I've spent few hours on the changes related to streaming dataframes (included in the SPARK-8360) and concluded that it's currently only possible to read.stream(), but not write.stream() since there are no streaming Sinks yet. Pozdrawiam, Jacek Laskowski

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Marius Soutier
> On 04.03.2016, at 22:39, Cody Koeninger wrote: > > The only other valid use of messageHandler that I can think of is > catching serialization problems on a per-message basis. But with the > new Kafka consumer library, that doesn't seem feasible anyway, and > could be

Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-08 Thread Nick Pentreath
As I mentioned, using that *train* method returns the user and item factor RDDs, as opposed to an ALSModel instance. You first need to construct a model manually yourself. This is exactly why it's marked as *DeveloperApi*, since it is not user-friendly and not strictly part of the ML pipeline

Spark structured streaming

2016-03-08 Thread Praveen Devarao
Hi, I would like to get my hands on the structured streaming feature coming out in Spark 2.0. I have tried looking around for code samples to get started but am not able to find any. Only few things I could look into is the test cases that have been committed under the JIRA umbrella

Re: RE: How to compile Spark with private build of Hadoop

2016-03-08 Thread fightf...@163.com
Hi, there, You may try to use nexus to establish maven local repository. I think this link would be helpful. http://www.sonatype.org/nexus/2015/02/27/setup-local-nexus-repository-and-deploy-war-file-from-maven/ After you had done the repository, you may use maven-deploy-plugin to deploy your

Re: How to compile Spark with private build of Hadoop

2016-03-08 Thread Saisai Shao
I think the first step is to publish your in-house built Hadoop related jars to your local maven or ivy repo, and then change the Spark building profiles like -Phadoop-2.x (you could use 2.7 or you have to change the pom file if you met jar conflicts) -Dhadoop.version=3.0.0-SNAPSHOT to build

Re: SSL support for Spark Thrift Server

2016-03-08 Thread Sumedh Wale
On Saturday 05 March 2016 02:46 AM, Sourav Mazumder wrote: Hi All, While starting the Spark Thrift Server I don't see any option to start it with SSL support. Is that support currently there ? It uses HiveServer2 so the SSL settings in hive-site.xml should work:

Re: Best way to merge files from streaming jobs

2016-03-08 Thread Sumedh Wale
On Saturday 05 March 2016 02:39 AM, Jelez Raditchkov wrote: My streaming job is creating files on S3. The problem is that those files end up very small if I just write them to S3 directly. This is why I use coalesce() to reduce the

Re: OOM exception during Broadcast

2016-03-08 Thread Olivier Girardot
Java's default serialization is not the best/most efficient way of handling ser/deser, did you try switching to Kryo serialization ? c.f. https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/ if you need a tutorial. This should help in terms of both CPU

Re: Spark Twitter streaming

2016-03-08 Thread Imre Nagi
Do you mean listening to the twitter stream data? Maybe you can use the Twitter Stream API or Twitter Search API for this purpose. Imre On Tue, Mar 8, 2016 at 2:54 PM, Soni spark wrote: > Hallo friends, > > I need a urgent help. > > I am using spark streaming to get

RE: How to compile Spark with private build of Hadoop

2016-03-08 Thread Lu, Yingqi
Thank you for the quick reply. I am very new to maven and always use the default settings. Can you please be a little more specific on the instructions? I think all the jar files from Hadoop build are located at Hadoop-3.0.0-SNAPSHOT/share/hadoop. Which ones I need to use to compile Spark and