Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to other languages, there is no indexing in plain RDDs, This was the thread with Ankit: Yes. So order preservation can not be guaranteed in the case of failure. Also not sure if partitions are ordered. Can you get the same sequence

Re: compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread bluejoe
Thanks for your reply! Actually, It is Ok when I use RDD.zip() like this: 1 def zipDatasets(m:Dataset[String], n:Dataset[Int])={ 2 m.sparkSession.createDataset(m.rdd.zip(n.rdd)); 3 } But in my project, the type of Dataset is designated by the caller, so I introduce X,Y: 1 def

Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled

2017-09-13 Thread Mikhailau, Alex
Has anyone seen the following warnings in the log after a kinesis stream has been re-sharded? com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask WARN Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during

how sequence of chained jars in spark.(driver/executor).extraClassPath matters

2017-09-13 Thread Richard Xin
so let's say I have chained path in spark.driver.extraClassPath/spark.executor.extraClassPath such as /path1/*:/path2/*, and I have different versions of the same jar under those 2 directories, how spark pick the version of jar to use, from /path1/*? Thanks.

Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-13 Thread kant kodali
Hi All, I am trying to read data from kafka, insert into Mongo and read from mongo and insert back into Kafka. I went with structured stream approach first however I believe I am making some naiver error because my map operations are not getting invoked. The pseudo code looks like this DataSet

Re: compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread Anastasios Zouzias
Hi there, If it is OK with you to work with DataFrames, you can do https://gist.github.com/zouzias/44723de11222535223fe59b4b0bc228c import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructField,StructType,IntegerType, LongType} val df = sc.parallelize(Seq( (1.0, 2.0), (0.0,

Re: RDD order preservation through transformations

2017-09-13 Thread lucas.g...@gmail.com
I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key). Is this similar? On 13 September 2017 at 10:46, Suzen, Mehmet

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions. On 13 Sep 2017 18:39, "Ankit Maloo" wrote: > AFAIK, the order of a rdd is maintained across a partition for Map > operations. There is no way a map operation can

Re: RDD order preservation through transformations

2017-09-13 Thread Ankit Maloo
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation can change sequence across a partition as partition is local and computation happens one record at a time. On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote: I think the

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
Thanks for your suggestion Vincent. Do not have much experience with akka as such. I will explore this option. On Tue, Sep 12, 2017 at 11:01 PM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > What about chaining with akka or akka stream and the fair scheduler ? > > Le 13 sept.

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods: https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RDD order preservation through transformations

2017-09-13 Thread johan.grande.ext
Hi, I'm a beginner using Spark with Scala and I'm having trouble understanding ordering in RDDs. I understand that RDDs are ordered (as they can be sorted) but that some transformations don't preserve order. How can I know which transformations preserve order and which don't? Regarding map,

Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde

Minimum cost flow problem solving in Spark

2017-09-13 Thread Swapnil Shinde
Hello Has anyone used Spark to solve minimum cost flow problems in Spark? I am quite new to combinatorial optimization algorithms so any help or suggestions, libraries are very appreciated. Thanks Swapnil

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska
Hi folks, I have created a table in the following manner: CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition ( list of columns ) COMMENT 'User Infomation' PARTITIONED BY (account_id String, product String, group_id String, year String, month String, day String) STORED AS

compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread 沈志宏
Hello,Since Dataset has no zip(..) methods, so I wrote following code to zip two datasets:  1 def zipDatasets[X: Encoder, Y: Encoder](spark: SparkSession, m: Dataset[X], n: Dataset[Y]) = { 2 val rdd = m.rdd.zip(n.rdd); 3 import spark.implicits._ 4

[Structured Streaming] Multiple sources best practice/recommendation

2017-09-13 Thread JG Perrin
Hi, I have different files being dumped on S3, I want to ingest them and join them. What does sound better to you? Have one " directory" for all or one per file format? If I have one directory for all, can you get some metadata about the file, like its name? If multiple directory, how can I

[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

2017-09-13 Thread Patrick Duin
Hi Spark users, I've got an issue where I wrote a filter on a Hive table using dataframes and despite setting: spark.sql.hive.metastorePartitionPruning=true no partitions are being pruned. In short: Doing this: table.filter("partition=x or partition=y") will result in Spark fetching all

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread vincent gromakowski
What about chaining with akka or akka stream and the fair scheduler ? Le 13 sept. 2017 01:51, "Sunita Arvind" a écrit : Hi Michael, I am wondering what I am doing wrong. I get error like: Exception in thread "main" java.lang.IllegalArgumentException: Schema must be