Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
Thanks Tathagata for your answer. The reason I was asking about controlling data size is that the javadoc indicate you can use foreach or collect on the dataframe.  If the data is very large then a collect may result in OOM. >From your answer it appears that the only way to control the size (in

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread Tathagata Das
1. It is all the result data in that trigger. Note that it takes a DataFrame which is a purely logical representation of data and has no association with partitions, etc. which are physical representations. 2. If you want to limit the amount of data that is processed in a trigger, then you should

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh
Hi: The documentation for Sink.addBatch is as follows:   /**   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if   * this method is called more than once with the same batchId (which will happen in the case of   * failures), then `data` should only be

Apache Spark - Using withWatermark for DataSets

2017-12-30 Thread M Singh
Hi: I am working with DataSets so that I can use mapGroupsWithState for business logic and then use dropDuplicates over a set of fields.  I would like to use the withWatermark so that I can restrict the how much state is stored.   >From the API it looks like withWatermark takes a string -

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
Thanks Eyal - it appears that these are the same patterns used for spark DStreams. On Wednesday, December 27, 2017 1:15 AM, Eyal Zituny wrote: Hiif you're interested in stopping you're spark application externally, you will probably need a way to communicate

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-27 Thread Eyal Zituny
Hi if you're interested in stopping you're spark application externally, you will probably need a way to communicate with the spark driver (which start and holds a ref to the spark context) this can be done by adding some code to the driver app, for example: - you can expose a rest api that

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017

Re: Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread Diogo Munaro Vieira
Window function requires a timestamp column because you will apply a function for each window (like an aggregation). You still can use UDF for customized tasks Em 25 de dez de 2017 20:15, "M Singh" escreveu: > Hi: > I would like to use window function on a DataSet

Re: Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread Diogo Munaro Vieira
Can you please post here your code? Em 25 de dez de 2017 19:24, "M Singh" escreveu: > Hi: > > I am using spark structured streaming (v 2.2.0) to read data from files. I > have configured checkpoint location. On stopping and restarting the > application, it looks

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread Diogo Munaro Vieira
Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh" escreveu: > Hi: > Are there any patterns/recommendations for gracefully stopping a > structured streaming application ? > Thanks > > >

Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread M Singh
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The window function requires Column as argument and can be used with DataFrames by passing the column. Is there any analogous window function or pointers to how window function can be used for DataSets ? Thanks

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi: I am using spark structured streaming (v 2.2.0) to read data from files. I have configured checkpoint location. On stopping and restarting the application, it looks like it is reading the previously ingested files.  Is that expected behavior ?   Is there anyway to prevent reading files that

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured streaming application ?Thanks

Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Scott Reynolds
which > describe the method train(). > > I ended up looking at the source code and found the method in the scala > source code. > (You can see the code here: > https://github.com/apache/spark/blob/v2.1.0/ml

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) So the method(s) exist, but not covered in the Scala API doc. How do you raise this as a ‘bug’ ? Thx -Mike

Apache Spark 2.3 and Apache ORC 1.4 finally

2017-12-05 Thread Dongjoon Hyun
Hi, All. Today, Apache Spark starts to use Apache ORC 1.4 as a `native` ORC implementation. SPARK-20728 Make OrcFileFormat configurable between `sql/hive` and `sql/core`. - https://github.com/apache/spark/commit/326f1d6728a7734c228d8bfaa69442a1c7b92e9b Thank you so much for all your supports

Apache Spark Downloads Page Error

2017-11-16 Thread rjsullivan
I just noticed that there's a problem on the Apache Spark Downloads page at: https://spark.apache.org/downloads.html Regardless of which option selected from the 'Choose a package type:' pulldown menu, the file listed for download is always: spark-2.2.0-bin-hadoop2.7.tgz I'm using Chrome

[ANNOUNCE] Apache Spark 2.1.2

2017-10-25 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.2! Apache Spark 2.1.2 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.2 visit http://spark.apache.org

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread lucas.g...@gmail.com
Right, that makes sense and I understood that. The thing I'm wondering about (And i think the answer is 'no' at this stage). When the optimizer is running and pushing predicates down, does it take into account indexing and other storage layer strategies in determining which predicates are

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-20 Thread Mich Talebzadeh
here below Gary filtered_df = spark.hiveContext.sql(""" SELECT * FROM df WHERE type = 'type' AND action = 'action' AND audited_changes LIKE '---\ncompany_id:\n- %' """) filtered_audits.registerTempTable("filtered_df") you are using hql to read

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2017-10-20 Thread lmk
Trying to improve the old solution. Do we have a better text classifier now in Spark Mllib? Regards, lmk -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index. If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
If the underlying table(s) have indexes on them. Does spark use those indexes to optimize the query? IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index. Would spark know to push that predicate to the

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table? I assume you are referring to "id" column on the table that you are reading through JDBC connection. Then you are creating a temp Table called "df". That temp table is created in temporary

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
IE: If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down? I noticed in a query like this: df = spark.hiveContext.read.jdbc( url=jdbc_url, table="schema.table", column="id", lowerBound=lower_bound_id, upperBound=upper_bound_id,

Apache Spark GraphX: java.lang.ArrayIndexOutOfBoundsException: -1

2017-10-16 Thread Andy Long
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433

Re: Apache Spark-Subtract two datasets

2017-10-13 Thread Nathan Kronenfeld
ouble] scala> fooDs.join(barDs, Seq("a"), "left_anti").collect.foreach(println) [a,1] On Thu, Oct 12, 2017 at 1:16 PM, Shashikant Kulkarni < shashikant.kulka...@gmail.com> wrote: > Hello, > > I have 2 datasets, Dataset and other is Dataset. I want > the l

Re: Apache Spark-Subtract two datasets

2017-10-12 Thread Imran Rajjad
have 2 datasets, Dataset and other is Dataset. I want > the list of records which are in Dataset but not in > Dataset. How can I do this in Apache Spark using Java Connector? I > am using Apache Spark 2.2.0 > > Thank you >

Apache Spark-Subtract two datasets

2017-10-12 Thread Shashikant Kulkarni
Hello, I have 2 datasets, Dataset and other is Dataset. I want the list of records which are in Dataset but not in Dataset. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0 Thank you

Re: Apache Spark - MLLib challenges

2017-09-23 Thread vaquar khan
MLIB is old RDD-based API since Apache Spark 2 is recommended to use dataset based APIs to get good performance and introduce ML. ML contains new API build around Dataset and ML Pipelines ,mllib is slowly being deprecated (this already happened in case of linear regression) MLIB currently

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Koert Kuipers
our main challenge has been the lack of support for missing values generally On Sat, Sep 23, 2017 at 3:41 AM, Irfan Kabli wrote: > Dear All, > > We are looking to position MLLib in our organisation for machine learning > tasks and are keen to understand if their are

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Aseem Bansal
This is something I wrote specifically for the challenges that we faced when taking spark ml models to production http://www.tothenew.com/blog/when-you-take-your-machine-learning-models-to-production-for-real-time-predictions/ On Sat, Sep 23, 2017 at 1:33 PM, Jörn Franke

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Jörn Franke
As far as I know there is currently no encryption in-memory in Spark. There are some research projects to create secure enclaves in-memory based on Intel sgx, but there is still a lot to do in terms of performance and security objectives. The more interesting question is why would you need this

Apache Spark - MLLib challenges

2017-09-23 Thread Irfan Kabli
Dear All, We are looking to position MLLib in our organisation for machine learning tasks and are keen to understand if their are any challenges that you might have seen with MLLib in production. We will be going with the pure open-source approach here, rather than using one of the hadoop

CVE-2017-12612 Unsafe deserialization in Apache Spark launcher API

2017-09-08 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Versions of Apache Spark from 1.6.0 until 2.1.1 Description: In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe deserialization of data received by its socket. This makes applications launched

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Denis Magda
Hello Anjaneya, Marco, Honestly, I’m not aware if the video broadcasting or recording is planned. Could you go to the meetup page [1] and raise the question there? Just in case, here is you can find a list of all upcoming Ignite related events [2]. Probably some of them will be in close

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Marco Mistroni
Hi Will there be a podcast to view afterwards for remote EMEA users? Kr On Sep 7, 2017 12:15 AM, "Denis Magda" wrote: > Folks, > > Those who are craving for mind food this weekend come over the meetup - > Santa Clara, Sept 9, 9.30 AM: >

[Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-06 Thread Denis Magda
Folks, Those who are craving for mind food this weekend come over the meetup - Santa Clara, Sept 9, 9.30 AM: https://www.meetup.com/datariders/events/242523245/?a=socialmedia — Denis

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
k You very much. It is great help, I will try spark-sklearn. >> >> Prem >> >> >> >> >> >> >> >> >> >> *From: *Yanbo Liang <yblia...@gmail.com> >> *Date: *Tuesday, September 5, 2017 at 10:40 AM >> *To: *Patrick Mc

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
> > > > > > *From: *Yanbo Liang <yblia...@gmail.com> > *Date: *Tuesday, September 5, 2017 at 10:40 AM > *To: *Patrick McCarthy <pmccar...@dstillery.com> > *Cc: *"Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" &

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Timsina, Prem
mber 5, 2017 at 10:40 AM To: Patrick McCarthy <pmccar...@dstillery.com> Cc: "Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm Hi

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071 On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem wrote: > Is there a way to parallelize multiple ML algorithms in Spark. My use case > is something like this: > > A) Run

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread Timsina, Prem
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread prtimsina
. And it seems cross-validation also can not be done in parallel. I appreciate any suggestion to parallelize this use case. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread vaquar khan
Hi Alex, Hope following links help you to understand why Spark is good for your usecase. - https://www.youtube.com/watch?v=tKkneWcAIqU=youtu.be - https://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/ - http

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread Irving Duran
xandr.poru...@gmail.com> wrote: > > > > Hello, > > > > I am new in Apache Spark. I need to process different time series data > (numeric values which depend on time) and react on next actions: > > 1. Data is changing up or down too fast. > > 2. Data is changing constant

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread kanth909
I don't see why not Sent from my iPhone > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov <alexandr.poru...@gmail.com> > wrote: > > Hello, > > I am new in Apache Spark. I need to process different time series data > (numeric values which depend on time) and react

[Upvote] for Apache Spark for 2017 Innovation Award

2017-08-29 Thread Jules Damji
Fellow Spark users, If you think, and believe, deep in your hearts that Apache Spark deserves an innovation award, cast your vote here: https://jaxlondon.com/jax-awards Cheers, Jules Sent from my iPhone Pardon the dumb thumb typos :)

[Spark] Can Apache Spark be used with time series processing?

2017-08-24 Thread Alexandr Porunov
Hello, I am new in Apache Spark. I need to process different time series data (numeric values which depend on time) and react on next actions: 1. Data is changing up or down too fast. 2. Data is changing constantly up or down too long. For example, if the data have changed 30% up or down

SWOT Analysis on Apache Spark

2017-08-24 Thread Irfan Kabli
Hi All, I am not sure if the users list is the right list for this query, but I am hoping if this is the wrong forum someone would point me to the right forum. I work for a company which uses proprietary analytical ecosystem. I am evangelising open-source and have been requested by management to

Apache Spark on Kubernetes: New Release for Spark 2.2

2017-08-14 Thread Erik Erlandson
The Apache Spark on Kubernetes Community Development Project is pleased to announce the latest release of Apache Spark with native Scheduler Backend for Kubernetes! Features provided in this release include: - Cluster-mode submission of Spark jobs to a Kubernetes cluster - Support

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
;> Awesome! Congrats! Can't wait!! >> >> jg >> >> >> On Jul 11, 2017, at 18:48, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> Hi all, >> >> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This >>

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread kant kodali
+1 On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin <j...@jgp.net> wrote: > Awesome! Congrats! Can't wait!! > > jg > > > On Jul 11, 2017, at 18:48, Michael Armbrust <mich...@databricks.com> > wrote: > > Hi all, > > Apache Spark 2.2.0 is the third

CVE-2017-7678 Apache Spark XSS web UI MHTML vulnerability

2017-07-12 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: Versions of Apache Spark before 2.2.0 Description: It is possible for an attacker to take advantage of a user's trust in the server to trick them into visiting a link that points to a shared Spark cluster and submits data

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Jean Georges Perrin
Awesome! Congrats! Can't wait!! jg > On Jul 11, 2017, at 18:48, Michael Armbrust <mich...@databricks.com> wrote: > > Hi all, > > Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release > removes the experimental tag from Structured Streaming. In a

[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust
Hi all, Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release removes the experimental tag from Structured Streaming. In addition, this release focuses on usability, stability, and polish, resolving over 1100 tickets. We'd like to thank our contributors and users

RE: using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-29 Thread Mahesh Sawaiker
PM To: user@spark.apache.org Subject: using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed We have a Big Data class planned and we’d like students to be able to start spark-shell or pyspark as their own user. However the Derby database locks

using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-28 Thread Robert Kudyba
We have a Big Data class planned and we’d like students to be able to start spark-shell or pyspark as their own user. However the Derby database locks the process from starting as another user: -rw-r--r-- 1 myuser staff 38 Jun 28 10:40 db.lck And these errors appear: ERROR PoolWatchThread:

[apache-spark] Re: Problem with master webui with reverse proxy when workers >= 10

2017-05-31 Thread Trevor McKay
ht? > > Thanks, > > Trevor McKay > > > > -- > > > View this message in context: http://apache-spark-user- list.1001560.n3.nabble.com/Problem-with-master-webui-with-reverse- proxy-when-workers-10-tp28729.html > > Sent from the Apache Spark User List mailing list arc

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-10 Thread albertoandreotti
ark-has.html> Alberto. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Synonym-handling-replacement-issue-with-UDF-in-Apache-Spark-tp28638p28675.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: Join streams Apache Spark

2017-05-09 Thread tencas
;sink" that can receive different sources stream throught the same port?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28670.html Sent from the Apache Spark User List mailing list a

Re: Join streams Apache Spark

2017-05-08 Thread saulshanabrook
I, actually, just ran it in a Docker image. But the point is, it doesn't need to run in the JVM, because it just runs as a separate process. Then your Java (or any other client) code sends messages to it over TCP and it relays them to Spark. On Mon, May 8, 2017 at 4:07 AM tencas [via Apache Spark

Re: Join streams Apache Spark

2017-05-08 Thread tencas
Yep, I mean the first script you posted. So, you can compile it to Java binaries for example ? Ok, I have no idea about Go. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28662.html Sent from the Apache Spark User List

Re: Join streams Apache Spark

2017-05-08 Thread Gourav Sengupta
> > I am using Apache Spark Streaming using a TCP connector to receive data. > I have a python application that connects to a sensor, and create a TCP > server that waits connection from Apache Spark, and then, sends json data > through this socket. > > How can I manage to

Re: Join streams Apache Spark

2017-05-07 Thread saulshanabrook
The script I wrote in Go? No I do not, but it's very easy to compile it to whatever platform you are running on! Doesn't need to be integrated in the same language as the rest of your code. On Sat, May 6, 2017 at 3:13 PM tencas [via Apache Spark User List] < ml+s1001560n28658...@n3.nabble.

Re: Join streams Apache Spark

2017-05-06 Thread tencas
There exists an Spark Streaming example of the classic word count, using apache kafka connector: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java (maybe you already know) The point is, what are the benefits from using

Re: Join streams Apache Spark

2017-05-06 Thread saulshanabrook
er TCP and writes each newline it receives to a new file in a folder. Then Spark can read them from that folder. On Sat, May 6, 2017 at 2:38 PM tencas [via Apache Spark User List] < ml+s1001560n2865...@n3.nabble.com> wrote: > Thanks @saulshanabrook, I'll have a look at it. > > I thi

Re: Join streams Apache Spark

2017-05-06 Thread tencas
Thanks @saulshanabrook, I'll have a look at it. I think apache kafka could be an alternative solution, but I haven't checked it yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28656.html Sent from the Apache Spark

Re: Join streams Apache Spark

2017-05-06 Thread saulshanabrook
it up and release it with some documentation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28655.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-03 Thread JayeshLalwani
, you get the output that you are getting -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Synonym-handling-replacement-issue-with-UDF-in-Apache-Spark-tp28638p28648.html Sent from the Apache Spark User List mailing list archive at N

[ANNOUNCE] Apache Spark 2.1.1

2017-05-02 Thread Michael Armbrust
We are happy to announce the availability of Spark 2.1.1! Apache Spark 2.1.1 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. To download Apache Spark 2.1.1 visit http://spark.apache.org

Re: Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread vincent gromakowski
Use cache or persist. The dataframe will be materialized when the 1st action is called and then be reused from memory for each following usage Le 1 mai 2017 4:51 PM, "Saulo Ricci" <infsau...@gmail.com> a écrit : > Hi, > > > I have the following code that is readi

Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread Saulo Ricci
Hi, I have the following code that is reading a table to a apache spark DataFrame: val df = spark.read.format("jdbc") .option("url","jdbc:postgresql:host/database") .option("dbtable","tablename").option("user","use

Synonym handling replacement issue with UDF in Apache Spark

2017-04-28 Thread Nishanth
anufacturerNames.get(str).toString()));           }                  dataFileContent.show(); I got to know that the amount of data is too huge for regexp_replace so got a solution to use UDFhttp://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java Method 2 (UDF) Li

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
Content = > sqlContext.load("com.databricks.spark.csv", > options); > > > while(names.hasMoreElements()) { > str = (String) names.nextElement(); > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_re

Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Nishanth
anufacturerNames.get(str).toString()));           }                  dataFileContent.show(); I got to know that the amount of data is too huge for regexp_replace so got a solution to use UDFhttp://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java Method 2 (UDF) Li

An Apache Spark metric sink for Kafka

2017-04-18 Thread Erik Erlandson
I wrote up a simple metric sink for Spark that publishes metrics to a Kafka broker. Each metric is published as a message (in json format), with the metric name as the message key. https://github.com/erikerlandson/spark-kafka-sink Build with "(x)sbt assembly" and make sure the resulting jar

Join streams Apache Spark

2017-04-15 Thread tencas
Hi everybody, I am using Apache Spark Streaming using a TCP connector to receive data. I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket. How can I manage to join many independent

Re: Does Apache Spark use any Dependency Injection framework?

2017-04-03 Thread Jacek Laskowski
ot; <kanth...@gmail.com> wrote: > Hi All, > > I am wondering if can get SparkConf > <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/SparkConf.html> > object > through Dependency Injection? I currently use HOCON > <https://github.com/typesafehub/c

Does Apache Spark use any Dependency Injection framework?

2017-04-02 Thread kant kodali
Hi All, I am wondering if can get SparkConf <https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/SparkConf.html> object through Dependency Injection? I currently use HOCON <https://github.com/typesafehub/config/blob/master/HOCON.md> library to store all key/value pa

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-30 Thread Karin Valisova
Looks like the parallelization into RDD was the right move I was omitting, JavaRDD jsonRDD = new JavaSparkContext(sparkSession. sparkContext()).parallelize(results); then I created a schema as List fields = new ArrayList(); fields.add(DataTypes.createStructField("column_name1",

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
D[(DeviceKey, Int)] = ShuffledRDD[1] at > repartitionAndSortWithinPartitions at :30 > > > Yong > > > ------ > *From:* Pariksheet Barapatre <pbarapa...@gmail.com> > *Sent:* Wednesday, March 29, 2017 9:02 AM > *To:* user > *Subject:* Second

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
la> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2)) res0: org.apache.spark.rdd.RDD[(DeviceKey, Int)] = ShuffledRDD[1] at repartitionAndSortWithinPartitions at :30 Yong From: Pariksheet Barapatre <pbarapa...@gmail.com> Sent: Wednesday, March

Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
Hi, <http://stackoverflow.com/questions/43038682/secondary-sort-using-apache-spark-1-6#> I am referring web link http://codingjunkie.net/spark-secondary-sort/ to implement secondary sort in my spark job. I have defined my key case class as case class DeviceKey(serialNum: String, eve

Re: apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Richard Xin
Maybe you could try something like that:        SparkSession sparkSession = SparkSession     .builder()     .appName("Rows2DataSet")     .master("local")     .getOrCreate();         List results = new LinkedList();         JavaRDD jsonRDD =          

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
Hi Spark dev & users, For those who use GraphFrames <http://graphframes.github.io/> (DataFrame-based graphs), we have published a new release 0.4.0. It adds support for Apache Spark 2.1, with versions published for Spark 2.1 and 2.0 and for Scala 2.10 and 2.11. *D

apache-spark: Converting List of Rows into Dataset Java

2017-03-28 Thread Karin Valisova
Hello! I am running Spark on Java and bumped into a problem I can't solve or find anything helpful among answered questions, so I would really appreciate your help. I am running some calculations, creating rows for each result: List results = new LinkedList(); for(something){

apache-spark: Converting List of Rows into Dataset Java

2017-03-27 Thread Karin Valisova
Hello! I am running Spark on Java and bumped into a problem I can't solve or find anything helpful among answered questions, so I would really appreciate your help. I am running some calculations, creating rows for each result: List results = new LinkedList(); for(something){

Re: Apache Spark MLIB

2017-02-24 Thread Jon Gregg
Here's a high level overview of Spark's ML Pipelines around when it came out: https://www.youtube.com/watch?v=OednhGRp938. But reading your description, you might be able to build a basic version of this without ML. Spark has broadcast variables

Apache Spark MLIB

2017-02-23 Thread Mina Aslani
Hi, I am going to start working on anomaly detection using Spark MLIB. Please note that I have not used Spark so far. I would like to read some data and if a user logged in from different ip address which is not common consider it as an anomaly, similar to what apple/google does. My preferred

java-lang-noclassdeffounderror-org-apache-spark-streaming-api-java-javastreamin

2017-02-09 Thread sathyanarayanan mudhaliyar
ingContext(sc, new Duration(2000)); Map<String, String> kafkaParams = new HashMap<>(); kafkaParams.put("metadata.broker.list", "localhost:9092"); error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/

[ML - Intermediate - Debug] - Loading Customized Transformers in Apache Spark raised a NullPointerException

2017-01-24 Thread Saulo Ricci
Hi, sorry if I'm being short here. I'm facing the issue related in this link <http://stackoverflow.com/questions/41844035/loading-customized-transformers-in-apache-spark-raised-a-nullpointerexception>, I would really appreciate any help from the team and happy to talk and discuss more

Re: apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread Sergey B.
exModestov <aleksandrmodes...@gmail.com> wrote: > I want to use Apache Spark for working with text data. There are some > Russian > symbols but Apache Spark shows me strings which look like as > "...\u0413\u041e\u0420\u041e...". What should I do for correcting them. > >

apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread AlexModestov
I want to use Apache Spark for working with text data. There are some Russian symbols but Apache Spark shows me strings which look like as "...\u0413\u041e\u0420\u041e...". What should I do for correcting them. -- View this message in context: http://apache-spark-user-list.

Re: Apache Spark example split/merge shards

2017-01-16 Thread Takeshi Yamamuro
/github.com/apache/spark/blob/master/external/ > kinesis-asl/src/main/scala/org/apache/spark/examples/ > streaming/KinesisWordCountASL.scala. > > It works well. But I do have one question. Every time I split/merge Kinesis > shards from the Amazon API, I have to restart the applic

Apache Spark example split/merge shards

2017-01-16 Thread noppanit
I'm totally new to Spark and I'm trying to learn from the example. I'm following this example https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala. It works well. But I do have one question. Every time I

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
> > Jacek > > On 29 Dec 2016 5:03 p.m., "Yin Huai" <yh...@databricks.com> wrote: > >> Hi all, >> >> Apache Spark 2.1.0 is the second release of Spark 2.x line. This release >> makes significant strides in the production readiness of Structured

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Jacek Laskowski
... Jacek On 29 Dec 2016 5:03 p.m., "Yin Huai" <yh...@databricks.com> wrote: > Hi all, > > Apache Spark 2.1.0 is the second release of Spark 2.x line. This release > makes significant strides in the production readiness of Structured > Streaming, with added support for

<    1   2   3   4   5   6   7   8   9   10   >