Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Petr Novak
Hello, when I set spark.executor.cores e.g. to 8 cores and spark.executor.memory to 8GB. It can allocate more executors with less cores for my app but each executors gets 8GB RAM. It is a problem because I can allocate more memory across cluster than expected, the worst case is 8x 1core

Re: question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread Petr Novak
Hi, I understand it just as that they will provide some lower latency interface and probably using jdbc so that 3rd party BI tools can integrate and query streams like they would be static datasets. If BI will repeat the query it will be updated. I don't know if BI tools are already heading

Re: Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak
I have to ask my colleague if there is any specific error but I think it just doesn't see files. Petr On Thu, Apr 21, 2016 at 11:54 AM, Petr Novak <oss.mli...@gmail.com> wrote: > Hello, > Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved > from DF.partitionB

Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak
Hello, Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved from DF.partitionBy (using Python). Is there any known reason, some config? Or it should generally work hence it is likely to be something wrong solely on our side? Many thanks, Petr

Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Petr Novak
Hi all, I believe that it used to be in documentation that Standalone mode is not for production. I'm either wrong or it was already removed. Having a small cluster between 5-10 nodes is Standalone recommended for production? I would like to go with Mesos but the question is if there is real

Re: Apache Arrow + Spark examples?

2016-02-24 Thread Petr Novak
How Arrows collide with Tungsten and its binary in-memory format. It will still has to convert between them. I assume they use similar concepts/layout hence it is likely the conversion can be quite efficient. Or is there a change that the current Tungsten in memory format would be replaced by

Spark runs only on Mesos v0.21?

2016-02-12 Thread Petr Novak
Hi all, based on documenation: "Spark 1.6.0 is designed for use with Mesos 0.21.0 and does not require any special patches of Mesos." We are considering Mesos for our use but this concerns me a lot. Mesos is currently on v0.27 which we need for its Volumes feature. But Spark locks us to 0.21

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Either setting it programatically doesn't work: sparkConf.setIfMissing("class", "...Main") In my current setting moving main to another package requires to propagate change to deploy scripts. Doesn't matter I will find some other way. Petr On Fri, Sep 25, 2015 at 4:40 PM,

--class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
Ortherwise it seems it tries to load from a checkpoint which I have deleted and cannot be found. Or it should work and I have wrong something else. Documentation doesn't mention option with jar manifest, so I assume it doesn't work this way. Many thanks, Petr

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak
I'm sorry. Both approaches actually work. It was something else wrong with my cluster. Petr On Fri, Sep 25, 2015 at 4:53 PM, Petr Novak <oss.mli...@gmail.com> wrote: > Either setting it programatically doesn't work: > sparkConf.setIfMissing("class", "...Main")

Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak
You can have offsetRanges on workers f.e. object Something { var offsetRanges = Array[OffsetRange]() def create[F : ClassTag](stream: InputDStream[Array[Byte]]) (implicit codec: Codec[F]: DStream[F] = { stream transform { rdd => offsetRanges =

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-25 Thread Petr Novak
PM, Petr Novak <oss.mli...@gmail.com> wrote: > Many thanks Cody, it explains quite a bit. > > I had couple of problems with checkpointing and graceful shutdown moving > from working code in Spark 1.3.0 to 1.5.0. Having InterruptedExceptions, > KafkaDirectStream couldn't initia

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak
You can implement your own case class supporting more then 22 fields. It is something like: class MyRecord(val val1: String, val val2: String, ... more then 22, in this case f.e. 26) extends Product with Serializable { def canEqual(that: Any): Boolean = that.isInstanceOf[MyRecord] def

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak
If you need to understand what is the magic Product then google up Algebraic Data Types and learn it together with what is Sum type. One option is http://www.stephanboyer.com/post/18/algebraic-data-types Enjoy, Petr On Wed, Sep 23, 2015 at 9:07 AM, Petr Novak <oss.mli...@gmail.com> wrote:

Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Petr Novak
Hi, I have 2 streams and checkpointing with code based on documentation. One stream is transforming data from Kafka and saves them to Parquet file. The other stream uses the same stream and does updateStateByKey to compute some aggregations. There is no gracefulShutdown. Both use about this code

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
And probably the original source code https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 On Tue, Sep 22, 2015 at 10:37 AM, Petr Novak <oss.mli...@gmail.com> wrote: > To complete design pattern: > > http://stackoverflow.com/questions/30450763/spark-streaming-and-

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
AM, Petr Novak <oss.mli...@gmail.com> wrote: > Ahh the problem probably is async ingestion to Spark receiver buffers, > hence WAL is required I would say. > > Petr > > On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > >> If MQTT can

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
;>>> Padma Ch >>>> >>>> On Mon, Sep 21, 2015 at 5:10 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> You can use broadcast variable for passing connection information. >>>>> >>>>> Cheers >>>

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
to restart Spark job. On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > Ahh the problem probably is async ingestion to Spark receiver buffers, > hence WAL is required I would say. > > On Tue, Sep 22, 2015 at 10:52 AM, Petr Novak <oss.mli...@gmail.com>

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
Ahh the problem probably is async ingestion to Spark receiver buffers, hence WAL is required I would say. Petr On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak <oss.mli...@gmail.com> wrote: > If MQTT can be configured with long enough timeout for ACK and can buffer > enough events w

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Petr Novak
Nice, thanks. So the note in build instruction for 2.11 is obsolete? Or there are still some limitations? http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Fri, Sep 11, 2015 at 2:19 PM, Petr Novak <oss.mli...@gmail.com> wrote: > Nice, thanks. > &

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com> wrote: > I should read my posts at least once to avoid so many typos. Hopefully you > are brave enough to read through. > > Petr > > On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com> wrote: > >

Re: passing SparkContext as parameter

2015-09-21 Thread Petr Novak
add @transient? On Mon, Sep 21, 2015 at 11:36 AM, Petr Novak <oss.mli...@gmail.com> wrote: > add @transient? > > On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch <learnings.chitt...@gmail.com> > wrote: > >> Hello All, >> >> How can i pass sparkCo

Re: Spark + Druid

2015-09-21 Thread Petr Novak
Great work. On Fri, Sep 18, 2015 at 6:51 PM, Harish Butani wrote: > Hi, > > I have just posted a Blog on this: > https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani > > regards, > Harish Butani. > > On Tue, Sep 1, 2015 at

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak
Internally Spark is using json4s and jackson parser v3.2.10, AFAIK. So if you are using Scala they should be available without adding dependencies. There is v3.2.11 already available but adding to my app was causing NoSuchMethod exception so I would have to shade it. I'm simply staying on v3.2.10

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak
provided version is fine for us for now. Regards, Petr On Mon, Sep 21, 2015 at 11:06 AM, Petr Novak <oss.mli...@gmail.com> wrote: > Internally Spark is using json4s and jackson parser v3.2.10, AFAIK. So if > you are using Scala they should be available without adding dependencies. > T

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
I think you would have to persist events somehow if you don't want to miss them. I don't see any other option there. Either in MQTT if it is supported there or routing them through Kafka. There is WriteAheadLog in Spark but you would have decouple stream MQTT reading and processing into 2

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-21 Thread Petr Novak
d just need a confirmation from community that checkpointing and graceful shutdown is actually working with KafkaDirectStream on 1.5.0 so that I can look for a problem on my side. Many thanks, Petr On Sun, Sep 20, 2015 at 12:58 PM, Petr Novak <oss.mli...@gmail.com> wrote: > Hi Michal, > ye

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
I should read my posts at least once to avoid so many typos. Hopefully you are brave enough to read through. Petr On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com> wrote: > I think you would have to persist events somehow if you don't want to miss > them. I don't s

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-20 Thread Petr Novak
t; > > On 18 September 2015 at 10:28, Petr Novak <oss.mli...@gmail.com> wrote: > >> It might be connected with my problems with gracefulShutdown in Spark >> 1.5.0 2.11 >> https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 >> >> Maybe Ctrl+C corrupts

Re: Kafka createDirectStream ​issue

2015-09-20 Thread Petr Novak
val topics="first" shouldn't it be val topics = Set("first") ? On Sun, Sep 20, 2015 at 1:01 PM, Petr Novak <oss.mli...@gmail.com> wrote: > val topics="first" > > shouldn't it be val topics = Set("first") ? > > On Sat, Sep 19, 2015 a

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
It might be connected with my problems with gracefulShutdown in Spark 1.5.0 2.11 https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 Maybe Ctrl+C corrupts checkpoints and breaks gracefulShutdown? Petr On Fri, Sep 18, 2015 at 4:10 PM, Petr Novak <oss.mli...@gmail.com>

Re: Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-18 Thread Petr Novak
text) [2015-09-11 22:33:05,899] INFO Shutdown hook called (org.apache.spark.util.ShutdownHookManager) [2015-09-11 22:33:05,899] INFO Deleting directory /dfs/spark/tmp/spark-b466fc2e-9ab8-4783-87c2-485bac5c3cd6 (org.apache.spark.util.ShutdownHookManager) Thanks, Petr On Mon, Sep 14, 2015 at 3:10

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
This one is generated, I suppose, after Ctrl+C 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor]

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on Spark 1.5.0 2.11. It would be nice if anybody could try on another installation to ensure it is something wrong on my cluster. Many thanks, Petr On Fri, Sep 18, 2015 at 4:07 PM, Petr Novak <oss.mli...@gmail.com>

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak
...to ensure it is not something wrong on my cluster. On Fri, Sep 18, 2015 at 4:09 PM, Petr Novak <oss.mli...@gmail.com> wrote: > I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on > Spark 1.5.0 2.11. It would be nice if anybody could try on another &g

KafkaDirectStream can't be recovered from checkpoint

2015-09-17 Thread Petr Novak
Hi all, it throws FileBasedWriteAheadLogReader: Error reading next item, EOF reached java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.streaming.util.FileBaseWriteAheadLogReader.hasNext(FileBasedWriteAheadLogReader.scala:47) WAL is not

Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Petr Novak
Does it still apply for 1.5.0? What actual limitation does it mean when I switch to 2.11? No JDBC Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obsolete I believe)? Some more? What library is the blocker to upgrade JDBC component to 2.11? Is there any estimate when it could be

Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-10 Thread Petr Novak
Hello, my Spark streaming v1.3.0 code uses sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) } to use Ctrl+C in command line to stop it. It returned back to command line after it finished batch but it doesn't with v1.4.0-v.1.5.0. Was the behaviour or required

Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Petr Novak
Hello, sqlContext.parquetFile(dir) throws exception " Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient" The strange thing is that on the second attempt to open the file it is successful: try { sqlContext.parquetFile(dir) } catch { case e: Exception =>

Re: DataFrame Parquet Writer doesn't keep schema

2015-08-26 Thread Petr Novak
The same as https://mail.google.com/mail/#label/Spark%2Fuser/14f64c75c15f5ccd Please follow the discussion there. On Tue, Aug 25, 2015 at 5:02 PM, Petr Novak oss.mli...@gmail.com wrote: Hi all, when I read parquet files with required fields aka nullable=false they are read correctly. Then I

DataFrame Parquet Writer doesn't keep schema

2015-08-25 Thread Petr Novak
Hi all, when I read parquet files with required fields aka nullable=false they are read correctly. Then I save them (df.write.parquet) and read again all my fields are saved and read as optional, aka nullable=true. Which means I suddenly have files with incompatible schemas. This happens on

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Petr Novak
{ (_: Nothing) = } } } Many thanks for any advice, I'm sure its a noob question. Petr On Mon, Aug 17, 2015 at 1:12 PM, Petr Novak oss.mli...@gmail.com wrote: Or can I generally create new RDD from transformation and enrich its partitions with some metadata so that I would copy OffsetRanges in my

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak
Or can I generally create new RDD from transformation and enrich its partitions with some metadata so that I would copy OffsetRanges in my new RDD in DStream? On Mon, Aug 17, 2015 at 1:08 PM, Petr Novak oss.mli...@gmail.com wrote: Hi all, I need to transform KafkaRDD into a new stream

Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak
Hi all, I need to transform KafkaRDD into a new stream of deserialized case classes. I want to use the new stream to save it to file and to perform additional transformations on it. To save it I want to use offsets in filenames, hence I need OffsetRanges in transformed RDD. But KafkaRDD is

spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Petr Novak
Hello, I would like to switch from Scala 2.10 to 2.11 for Spark app development. It seems that the only thing blocking me is a missing spark-streaming-kafka_2.11 maven package. Any plan to add it or am I just blind? Many thanks, Vladimir

Re: Does Spark Driver works with HDFS in HA mode

2014-09-29 Thread Petr Novak
Thank you. HADOOP_CONF_DIR has been missing. On Wed, Sep 24, 2014 at 4:48 PM, Matt Narrell matt.narr...@gmail.com wrote: Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote: Hello, if our

Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Petr Novak
Hello, if our Hadoop cluster is configured with HA and fs.defaultFS points to a namespace instead of a namenode hostname - hdfs://namespace_name/ - then our Spark job fails with exception. Is there anything to configure or it is not implemented? Exception in thread main