Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
Well, the difference is, a technical user writes the UDF and a non-technical user may use this built-in thing (misconfigure it) and shoot themselves in the foot. On Wed, Jul 1, 2020, 6:40 PM Andrew Melo wrote: > On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz wrote: > > > > I'm

Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc. Best, Burak On Wed, Jul 1, 2020 at 5:54 PM Holden Karau wrote:

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream? On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > Hi all, > > I have a spark structured streaming app that is consuming from a kafka > topic with retention set up. Sometimes I face an issue where my query has > not finished

Re: ForEachBatch collecting batch to driver

2020-03-11 Thread Burak Yavuz
foreachBatch gives you the micro-batch as a DataFrame, which is distributed. If you don't call collect on that DataFrame, it shouldn't have any memory implications on the Driver. On Tue, Mar 10, 2020 at 3:46 PM Ruijing Li wrote: > Hi all, > > I’m curious on how foreachbatch works in spark

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
write mode but > my thinking was at that point it was easier to read from kafka through > batch mode. > > Thanks, > RJ > > On Tue, Feb 4, 2020 at 4:20 PM Burak Yavuz wrote: > >> Hi Ruijing, >> >> Why do you not want to use structured streaming here?

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
Hi Ruijing, Why do you not want to use structured streaming here? This is exactly why structured streaming + Trigger.Once was built, just so that you don't build that solution yourself. You also get exactly once semantics if you use the built in sinks. Best, Burak On Mon, Feb 3, 2020 at 3:15 PM

Re: Structured Streaming & Enrichment Broadcasts

2019-11-18 Thread Burak Yavuz
If you store the data that you're going to broadcast as a Delta table (see delta.io) and perform a stream-batch (where your Delta table is the batch) join, it will auto-update once the table receives any updates. Best, Burak On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey wrote: > Hello. > > We're

Re: Delta with intelligent upsett

2019-11-02 Thread Burak Yavuz
You can just add the target partitioning filter to your MERGE or UPDATE condition, e.g. MERGE INTO target USING source ON target.key = source.key AND target.year = year(current_date()) ... Best, Burak On Thu, Oct 31, 2019, 10:15 PM ayan guha wrote: > > Hi > > we have a scenario where we have

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
ion is using offsets that are no longer available in >>> Kafka it will reset to earliest or latest offset available in Kafka and the >>> next request made to Kafka should provide proper data. But in case for all >>> micro-batches the offsets are getting reseted and the b

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
Do you have rate limiting set on your stream? It may be that you are trying to process records that have passed the retention period within Kafka. On Wed, Sep 11, 2019 at 2:39 PM Charles vinodh wrote: > > Hi, > > I am trying to run a spark application ingesting data from Kafka using the > Spark

Re: Static partitioning in partitionBy()

2019-05-07 Thread Burak Yavuz
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia

Re: Spark Structured Streaming using S3 as data source

2018-08-26 Thread Burak Yavuz
Yes, the checkpoint makes sure that you start off from where you left off. On Sun, Aug 26, 2018 at 2:22 AM sherif98 wrote: > I have data that is continuously pushed to multiple S3 buckets. I want to > set > up a structured streaming application that uses the S3 buckets as the data > source and

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Burak Yavuz
If you don't set rate limiting through `maxOffsetsPerTrigger`, Structured Streaming will always process until the end of the stream. So number of records waiting to be processed should be 0 at the start of each trigger. On Mon, Jul 30, 2018 at 8:03 AM, Kailash Kalahasti <

Re: Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Burak Yavuz
I believe the docs are out of date regarding distinct. The behavior should be as follows: - Distinct should be applied across triggers - In order to prevent the state from growing indefinitely, you need to add a watermark - If you don't have a watermark, but your key space is small, that's

Re: Infer JSON schema in structured streaming Kafka.

2017-12-11 Thread Burak Yavuz
In Spark 2.2, you can read from Kafka in batch mode, and then use the json reader to infer schema: val df = spark.read.format("kafka")... .select($"value.cast("string")) val json = spark.read.json(df) val schema = json.schema While the above should be slow (since you're reading almost all data

Re: Reload some static data during struct streaming

2017-11-13 Thread Burak Yavuz
I think if you don't cache the jdbc table, then it should auto-refresh. On Mon, Nov 13, 2017 at 1:21 PM, spark receiver wrote: > Hi > > I’m using struct streaming(spark 2.2) to receive Kafka msg ,it works > great. The thing is I need to join the Kafka message with a

Re: Getting Message From Structured Streaming Format Kafka

2017-11-02 Thread Burak Yavuz
Hi Daniel, Several things: 1) Your error seems to suggest you're using a different version of Spark and a different version of the sql-kafka connector. Could you make sure they are on the same Spark version? 2) With Structured Streaming, you may remove everything related to a StreamingContext.

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-16 Thread Burak Yavuz
Hi Darshan, How are you creating your kafka stream? Can you please share the options you provide? spark.readStream.format("kafka") .option(...) // all these please .load() On Sat, Oct 14, 2017 at 1:55 AM, Darshan Pandya wrote: > Hello, > > I'm using Spark 2.1.0

Re: Structured streaming coding question

2017-09-20 Thread Burak Yavuz
).foreach(new > KafkaSink("hello2")).start(); > > query1.awaitTermination(); > query2.awaitTermination(); > sparkSession.streams().awaitAnyTermination(); > > > > > > On Tue, Sep 19, 2017 at 11:48 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> H

Re: Structured streaming coding question

2017-09-20 Thread Burak Yavuz
Hey Kant, That won't work either. Your second query may fail, and as long as your first query is running, you will not know. Put this as the last line instead: spark.streams.awaitAnyTermination() On Tue, Sep 19, 2017 at 10:11 PM, kant kodali wrote: > Looks like my problem

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Burak Yavuz
Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18.

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Burak Yavuz
>> Spark 2.1.1. I may be able to upgrade to 2.2.0 if that makes life easier. >> >> >> >> On Tue, Aug 29, 2017 at 5:25 PM, Burak Yavuz <brk...@gmail.com> wrote: >> >>> Hey TD, >>> >>> If I understood the question correctly, your so

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Burak Yavuz
Hey TD, If I understood the question correctly, your solution wouldn't return the exact solution, since it also groups by on destination. I would say the easiest solution would be to use flatMapGroupsWithState, where you: .groupByKey(_.train) and keep in state the row with the maximum time. On

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Burak Yavuz
Hi Jacek, The way the memory sink is architected at the moment is that it either appends a row (append/update mode) or replaces all rows (complete mode). When a user specifies a checkpoint location, the guarantee Structured Streaming provides is that output sinks will not lose data and will be

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Burak Yavuz
Hi Priyank, You may register them as temporary tables to use across language boundaries. Python: df = spark.readStream... # Python logic df.createOrReplaceTempView("tmp1") Scala: val df = spark.table("tmp1") df.writeStream .foreach(...) On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Burak Yavuz
I think Kant meant time windowing functions. You can use `window(TIMESTAMP, '24 hours', '24 hours')` On Tue, Jul 25, 2017 at 9:26 AM, Keith Chapman wrote: > Here is an example of a window lead function, > > select *, lead(someColumn1) over ( partition by someColumn2

Re: to_json not working with selectExpr

2017-07-16 Thread Burak Yavuz
Hi Matthew, Which Spark version are you using? The expression `to_json` was added in 2.2 with this commit: https://github.com/apache/spark/commit/0cdcf9114527a2c359c25e46fd6556b3855bfb28 Best, Burak On Sun, Jul 16, 2017 at 6:24 PM, Matthew cao wrote: > Hi all, > I just

Re: Querying on Deeply Nested JSON Structures

2017-07-16 Thread Burak Yavuz
Have you checked out this blog post? https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html Shows tools and tips on how to work with nested data. You can access data through `field1.field2.field3` and such with JSON. Best, Burak On Sat,

Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Burak Yavuz
Hi Kaniska, In order to use append mode with aggregations, you need to set an event time watermark (using `withWatermark`). Otherwise, Spark doesn't know when to output an aggregation result as "final". Best, Burak On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal

Re: Spark SQL within a DStream map function

2017-06-16 Thread Burak Yavuz
Do you really need to create a DStream from the original messaging queue? Can't you just read them in a while loop or something on the driver? On Fri, Jun 16, 2017 at 1:01 PM, Mike Hugo wrote: > Hello, > > I have a web application that publishes JSON messages on to a messaging

Re: Structured Streaming from Parquet

2017-05-25 Thread Burak Yavuz
Hi Paul, >From what you're describing, it seems that stream1 is possibly generating tons of small files and stream2 is OOMing because it tries to maintain an in-memory list of files. Some notes/questions: 1. Parquet files are splittable, therefore having large parquet files shouldn't be a

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant, > > > 1. Can we use Spark Structured Streaming for stateless transformations > just like we would do with DStreams or Spark Structured Streaming is only > meant for stateful computations? > Of course you can do stateless transformations. Any map, filter, select, type of transformation

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
tly the same schema, but > one side support null and the other doesn't, this exception (in union > dataset) will be thrown? > > > > 2017-05-08 16:41 GMT-03:00 Burak Yavuz <brk...@gmail.com>: > >> I also want to add that generally these may be caused by the >> `nu

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
I also want to add that generally these may be caused by the `nullability` field in the schema. On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu wrote: > This is because RDD.union doesn't check the schema, so you won't see the > problem unless you run RDD and hit

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz
Hi Everett, IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and that's why you should be seeing the slowdown. Best, Burak On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
le. How can I > do that? > > 2017-02-06 14:25 GMT-08:00 Burak Yavuz <brk...@gmail.com>: > >> Hi Egor, >> >> Structured Streaming handles all of its metadata itself, which files are >> actually valid, etc. You may use the "create table" syntax in SQL

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
Hi Egor, Structured Streaming handles all of its metadata itself, which files are actually valid, etc. You may use the "create table" syntax in SQL to treat it like a hive table, but it will handle all partitioning information in its own metadata log. Is there a specific reason that you want to

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz
ount happens most likely rdd.isCheckpointed > will be false, and the count will be on the rdd before it was checkpointed. > what is the benefit of that? > > > On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Hi, >> >> One of the goal

Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz
Hi, One of the goals of checkpointing is to cut the RDD lineage. Otherwise you run into StackOverflowExceptions. If you eagerly checkpoint, you basically cut the lineage there, and the next operations all depend on the checkpointed DataFrame. If you don't checkpoint, you continue to build the

Re: Java heap error during matrix multiplication

2017-01-26 Thread Burak Yavuz
Hi, Have you tried creating more column blocks? BlockMatrix matrix = cmatrix.toBlockMatrix(100, 100); for example. Is your data randomly spread out, or do you generally have clusters of data points together? On Wed, Jan 25, 2017 at 4:23 AM, Petr Shestov wrote: > Hi

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
deshpande <deshpandesh...@gmail.com> wrote: > Thanks Burak. But with BloomFilter, won't I be getting a false poisitve? > > On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote: > >> I noticed that 1 wouldn't be a problem, because you'll save t

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
gave me 2 solutions > 1. Bloom filter --> problem in repopulating the bloom filter on restarts > 2. keeping the state of the unique ids > > Please elaborate on 2. > > > > On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote: > >> I don't

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
gt; Thanks > > On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote: > >> Off the top of my head... (Each may have it's own issues) >> >> If upstream you add a uniqueId to all your records, then you may use a >> BloomFilter to appro

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Off the top of my head... (Each may have it's own issues) If upstream you add a uniqueId to all your records, then you may use a BloomFilter to approximate if you've seen a row before. The problem I can see with that approach is how to repopulate the bloom filter on restarts. If you are certain

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Burak Yavuz
Hi Daniela, This is trivial with Structured Streaming. If your Kafka cluster is 0.10.0 or above, you may use Spark 2.0.2 to create a Streaming DataFrame from Kafka, and then also create a DataFrame using the JDBC connection, and you may join those. In Spark 2.1, there's support for a function

Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Burak Yavuz
Hi Jacek, Can't you simply have a mapPartitions task throw an exception or something? Are you trying to do something more esoteric? Best, Burak On Sat, Jun 18, 2016 at 5:35 AM, Jacek Laskowski wrote: > Hi, > > Following up on this question, is a stage considered failed only

Re: Any NLP lib could be used on spark?

2016-04-19 Thread Burak Yavuz
A quick search on spark-packages returns: http://spark-packages.org/package/databricks/spark-corenlp. You may need to build it locally and add it to your session by --jars. On Tue, Apr 19, 2016 at 10:47 AM, Gavin Yue wrote: > Hey, > > Want to try the NLP on the spark.

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-23 Thread Burak Yavuz
You could use the Bucketizer transformer in Spark ML. Best, Burak On Tue, Feb 23, 2016 at 9:13 AM, Arunkumar Pillai wrote: > Hi > Is there any predefined method to calculate histogram bins and frequency > in spark. Currently I take range and find bins then count

Re: Using SPARK packages in Spark Cluster

2016-02-12 Thread Burak Yavuz
Hello Gourav, The packages need to be loaded BEFORE you start the JVM, therefore you won't be able to add packages dynamically in code. You should use the --packages with pyspark before you start your application. One option is to add a `conf` that will load some packages if you are constantly

Re: Redirect Spark Logs to Kafka

2016-02-01 Thread Burak Yavuz
You can use the KafkaLog4jAppender ( https://github.com/apache/kafka/blob/trunk/log4j-appender/src/main/java/org/apache/kafka/log4jappender/KafkaLog4jAppender.java ). Best, Burak On Mon, Feb 1, 2016 at 12:20 PM, Ashish Soni wrote: > Hi All , > > Please let me know how we

Re: Guidelines for writing SPARK packages

2016-02-01 Thread Burak Yavuz
Thanks for the reply David, just wanted to fix one part of your response: > If you > want to register a release for your package you will also need to push > the artifacts for your package to Maven central. > It is NOT necessary to push to Maven Central in order to make a release. There are

Re: Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Burak Yavuz
BlockMatrix.multiply is the suggested method of multiplying two large matrices. Is there a reason that you didn't use BlockMatrices? You can load the matrices and convert to and from RowMatrix. If it's in sparse format (i, j, v), then you can also use the CoordinateMatrix to load, BlockMatrix to

Re: number of blocks in ALS/recommendation API

2015-12-17 Thread Burak Yavuz
Copying the first part from the scaladoc: " This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and it is super hard to debug. It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but it is 1.2.1 in Spark 1.5.1. I feel like that seems to be the problem... Brian, did you verify that it works with the

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
t; wrote: > >> Yes, it worked in the 1.6 branch as of commit >> db5165246f2888537dd0f3d4c5a515875c7358ed. That makes it much less >> serious of an issue, although it would be nice to know what the root cause >> is to avoid a regression. >> >> On Thu, De

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
Not sure if it would be the most efficient, but maybe you can think of the filesystem as a key value store, and write each batch to a sub-directory, where the directory name is the batch time. If the directory already exists, then you shouldn't write it. Then you may have a following batch job

Re: large, dense matrix multiplication

2015-11-13 Thread Burak Yavuz
Hi, The BlockMatrix multiplication should be much more efficient on the current master (and will be available with Spark 1.6). Could you please give that a try if you have the chance? Thanks, Burak On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote:

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Burak Yavuz
Hi Jakob, > As another, general question, are spark packages the go-to way of extending spark functionality? Definitely. There are ~150 Spark Packages out there in spark-packages.org. I use a lot of them in every day Spark work. The number of released packages have steadily increased rate over

Re: spark-submit --packages using different resolver

2015-10-03 Thread Burak Yavuz
Hi Jerry, The --packages feature doesn't support private repositories right now. However, in the case of s3, maybe it might work. Could you please try using the --repositories flag and provide the address: `$ spark-submit --packages my:awesome:package --repositories

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-09-09 Thread Burak Yavuz
ES PLEASE! > > :))) > > On Tue, Aug 25, 2015 at 1:57 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Hmm. I have a lot of code on the local linear algebra operations using >> Spark's Matrix and Vector representations >> done for https://issues.apache.org/jira/bro

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Burak Yavuz
Or you can just call describe() on the dataframe? In addition to min-max, you'll also get the mean, and count of non-null and non-NA elements as well. Burak On Fri, Aug 28, 2015 at 10:09 AM, java8964 java8...@hotmail.com wrote: Or RDD.max() and RDD.min() won't work for you? Yong

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Burak Yavuz
Hmm. I have a lot of code on the local linear algebra operations using Spark's Matrix and Vector representations done for https://issues.apache.org/jira/browse/SPARK-6442. I can make a Spark package with that code if people are interested. Best, Burak On Tue, Aug 25, 2015 at 10:54 AM, Kristina

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
textFile is a lazy operation. It doesn't evaluate until you call an action on it, such as .count(). Therefore, you won't catch the exception there. Best, Burak On Mon, Aug 24, 2015 at 9:09 AM, Roberto Coluccio roberto.coluc...@gmail.com wrote: Hello folks, I'm experiencing an unexpected

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
to evaluate the actual result), and there I can observe and catch the exception. Even considering Spark's laziness, shouldn't I catch the exception while occurring in the try..catch statement that encloses the textFile invocation? Best, Roberto On Mon, Aug 24, 2015 at 7:38 PM, Burak Yavuz brk

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-21 Thread Burak Yavuz
:50 PM, Burak Yavuz wrote: Matrix.toBreeze is a private method. MLlib matrices have the same structure as Breeze Matrices. Just create a new Breeze matrix like this https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Burak Yavuz
Matrix.toBreeze is a private method. MLlib matrices have the same structure as Breeze Matrices. Just create a new Breeze matrix like this https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270 . Best,

Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Burak Yavuz
If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack charles.t.h...@gmail.com wrote: Hi, I'm new to spark and am trying to create a Spark df from a

Re: Unit Testing

2015-08-13 Thread Burak Yavuz
I would recommend this spark package for your unit testing needs ( http://spark-packages.org/package/holdenk/spark-testing-base). Best, Burak On Thu, Aug 13, 2015 at 5:51 AM, jay vyas jayunit100.apa...@gmail.com wrote: yes there certainly is, so long as eclipse has the right plugins and so on

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed on master branch. You may be hitting that. Best, Burak On Sun, Aug 2, 2015 at 9:06 PM, Ted Yu yuzhih...@gmail.com wrote: I tried the following command on master branch: bin/spark-shell

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
In addition, you do not need to use --jars with --packages. --packages will get the jar for you. Best, Burak On Mon, Aug 3, 2015 at 9:01 AM, Burak Yavuz brk...@gmail.com wrote: Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Burak Yavuz
Hey Stephen, In case these libraries exist on the client as a form of maven library, you can use --packages to ship the library and all it's dependencies, without building an uber jar. Best, Burak On Tue, Jul 28, 2015 at 10:23 AM, Marcelo Vanzin van...@cloudera.com wrote: Hi Stephen, There

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Burak Yavuz
Hi Jonathan, I believe calling persist with StorageLevel.NONE doesn't do anything. That's why the unpersist has an if statement before it. Could you give more information about your setup please? Number of cores, memory, number of partitions of ratings_train? Thanks, Burak On Wed, Jul 22, 2015

Re: LinearRegressionWithSGD Outputs NaN

2015-07-21 Thread Burak Yavuz
Hi, Could you please decrease your step size to 0.1, and also try 0.01? You could also try running L-BFGS, which doesn't have step size tuning, to get better results. Best, Burak On Tue, Jul 21, 2015 at 2:59 AM, Naveen nav...@formcept.com wrote: Hi , I am trying to use

Re: RowId unique key for Dataframes

2015-07-21 Thread Burak Yavuz
Would monotonicallyIncreasingId https://github.com/apache/spark/blob/d4c7a7a3642a74ad40093c96c4bf45a62a470605/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L637 work for you? Best, Burak On Tue, Jul 21, 2015 at 4:55 PM, Srikanth srikanth...@gmail.com wrote: Hello, I'm

Re: Strange behavoir of pyspark with --jars option

2015-07-15 Thread Burak Yavuz
Hi, I believe the HiveContext uses a different class loader. It then falls back to the system class loader if it can't find the classes in the context class loader. The system class loader contains the classpath passed through --driver-class-path and spark.executor.extraClassPath. The JVM is

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-15 Thread Burak Yavuz
Hi, Is this in LibSVM format? If so, the indices should be sorted in increasing order. It seems like they are not sorted. Best, Burak On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van ngovi.se@gmail.com wrote: Hi All, I've met a issue with MLlib when i use LogisticRegressionWithLBFGS my

Re: creating a distributed index

2015-07-15 Thread Burak Yavuz
Hi Swetha, IndexedRDD is available as a package on Spark Packages http://spark-packages.org/package/amplab/spark-indexedrdd. Best, Burak On Tue, Jul 14, 2015 at 5:23 PM, swetha swethakasire...@gmail.com wrote: Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread Burak Yavuz
Hi, There is no MLlib support in SparkR in 1.4. There will be some support in 1.5. You can check these JIRAs for progress: https://issues.apache.org/jira/browse/SPARK-6805 https://issues.apache.org/jira/browse/SPARK-6823 Best, Burak On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Burak Yavuz
Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map = sVec.indices.zip(sVec.values).toMap ``` Best, Burak On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong dongda...@gmail.com wrote: Hi,

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
, Nirmal Fernando nir...@wso2.com wrote: I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot

Re: Unit tests of spark application

2015-07-10 Thread Burak Yavuz
I can +1 Holden's spark-testing-base package. Burak On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau hol...@pigscanfly.ca wrote: Somewhat biased of course, but you can also use spark-testing-base from spark-packages.org as a basis for your unittests. On Fri, Jul 10, 2015 at 12:03 PM, Daniel

Re: How to ignore features in mllib

2015-07-09 Thread Burak Yavuz
If you use the Pipelines Api with DataFrames, you select which columns you would like to train on using the VectorAssembler. While using the VectorAssembler, you can choose not to select some features if you like. Best, Burak On Thu, Jul 9, 2015 at 10:38 AM, Arun Luthra arun.lut...@gmail.com

Re: spark-submit can not resolve spark-hive_2.10

2015-07-07 Thread Burak Yavuz
spark-hive is excluded when using --packages, because it can be included in the spark-assembly by adding -Phive during mvn package or sbt assembly. Best, Burak On Tue, Jul 7, 2015 at 8:06 AM, Hao Ren inv...@gmail.com wrote: I want to add spark-hive as a dependence to submit my job, but it

Re: Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too large, and there is Integer overflow. Could you double your number of partitions? Burak On Fri, Jul 3, 2015 at 4:41 AM, Danny kont...@dannylinden.de wrote: hi, i want to run a multiclass classification with 390 classes

Re: coalesce on dataFrame

2015-07-01 Thread Burak Yavuz
You can use df.repartition(1) in Spark 1.4. See here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1396 . Best, Burak On Wed, Jul 1, 2015 at 3:05 AM, Olivier Girardot ssab...@gmail.com wrote: PySpark or Spark (scala) ? When you use

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-30 Thread Burak Yavuz
, Jun 29, 2015 at 11:33 PM, SLiZn Liu sliznmail...@gmail.com wrote: Hi Burak, Is `--package` flag only available for maven, no sbt support? On Tue, Jun 30, 2015 at 2:26 PM Burak Yavuz brk...@gmail.com wrote: You can pass `--packages your:comma-separated:maven-dependencies` to spark submit

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-30 Thread Burak Yavuz
You can pass `--packages your:comma-separated:maven-dependencies` to spark submit if you have Spark 1.3 or greater. Best regards, Burak On Mon, Jun 29, 2015 at 10:46 PM, SLiZn Liu sliznmail...@gmail.com wrote: Hey Spark Users, I'm writing a demo with Spark and HBase. What I've done is

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
Hi Wei, For example, when a straggler executor gets killed in the middle of a map operation and it's task is restarted at a different instance, the accumulator will be updated more than once. Best, Burak On Wed, Jun 24, 2015 at 1:08 PM, Wei Zhou zhweisop...@gmail.com wrote: Quoting from Spark

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
the transformation ended up updating accumulator more than once? Best, Wei 2015-06-24 13:23 GMT-07:00 Burak Yavuz brk...@gmail.com: Hi Wei, For example, when a straggler executor gets killed in the middle of a map operation and it's task is restarted at a different instance, the accumulator

Re: Confusion matrix for binary classification

2015-06-22 Thread Burak Yavuz
Hi, In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion matrix. This would be very simple if you are using the ML Pipelines Api, and are working with DataFrames. Best, Burak On Mon, Jun 22, 2015 at 4:21 AM, CD Athuraliya cdathural...@gmail.com wrote: Hi, I am looking

Re: SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Burak Yavuz
Hey Nathan, I like the first idea better. Let's see what others think. I'd be happy to review your PR afterwards! Best, Burak On Thu, Jun 18, 2015 at 9:53 PM, Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Hey, Spark Submit adds maven central spark bintray to the ChainResolver

Re: --packages Failed to load class for data source v1.4

2015-06-14 Thread Burak Yavuz
Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the

Re: How to read avro in SparkR

2015-06-13 Thread Burak Yavuz
Hi, Not sure if this is it, but could you please try com.databricks.spark.avro instead of just avro. Thanks, Burak On Jun 13, 2015 9:55 AM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am trying to read a avro file in SparkR (in Spark 1.4.0). I started R using the following.

Re: foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread Burak Yavuz
Or you can simply use `reduceByKeyLocally` if you don't want to worry about implementing accumulators and such, and assuming that the reduced values will fit in memory of the driver (which you are assuming by using accumulators). Best, Burak On Thu, May 21, 2015 at 2:46 PM, ben

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Burak Yavuz
Could you please open a JIRA for it? The maxBins input is missing for the Python Api. Is it possible if you can use the current master? In the current master, you should be able to use trees with the Pipeline Api and DataFrames. Best, Burak On Wed, May 20, 2015 at 2:44 PM, Don Drake

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote: oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth

Re: DataFrame filter referencing error

2015-04-30 Thread Burak Yavuz
Is new a reserved word for MySQL? On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella francesco.bigare...@gmail.com wrote: Do you know how I can check that? I googled a bit but couldn't find a clear explanation about it. I also tried to use explain() but it doesn't really help. I still

  1   2   >