[Spark2] huge BloomFilters

2016-11-02 Thread ponkin
Hi, I need to build huge BloomFilter with 150 millions or even more insertions import org.apache.spark.util.sketch.BloomFilter val bf = spark.read.avro("/hdfs/path").filter("some == 1").stat.bloomFilter("id", 15000, 0.01) if I use keys for serialization implicit val bfEncoder =

Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread ponkin
=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ? Here is my Dockerfile https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-shell-inside-Docker-container-against-remote-YARN-cluster-tp27967.html

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
I generated CSV file with 300 columns, and it seems to work fine with Spark Dataframes(Spark 2.0). I think you need to post your issue in spark-cassandra-connector community (https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user) - if you are using it. -- View this

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
Did you try to load wide, for example, CSV file or Parquet? May be the problem is in spark-cassandra-connector not Spark itself? Are you using spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? -- View this message in context:

Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
Hi, What kind of datasource do you have? CSV, Avro, Parquet? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

[KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread ponkin
Hi, Here is my use case : I have kafka topic. The job is fairly simple - it reads topic and save data to several hdfs paths. I create rdd with the following code val r = KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range) Then I am trying to

[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread ponkin
checkpoints)? Is it possible at all or I need to store offsets in external datastore? Alexey Ponkin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461.html Sent from t

[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Alexey Ponkin
Hi I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra. The problem is that when I increase number of incoming messages in topic the job is starting to fail with

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Koeninger" <c...@koeninger.org>: >  Can you provide more info (what version of spark, code example)? > >  On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote: >>  Hi, >> >>  I have an application with 2 streams, which are joined together.

[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window

[streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Alexey Ponkin
Hi, I have the following code object MyJob extends org.apache.spark.Logging{ ... val source: DStream[SomeType] ... source.foreachRDD { rdd => logInfo(s"""+++ForEachRDD+++""") rdd.foreachPartition { partitionOfRecords => logInfo(s"""+++ForEachPartition+++""") } } I

[spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-04 Thread ponkin
Hi, I am trying to read kafka topic with new directStream method in KafkaUtils. I have Kafka topic with 8 partitions. I am running streaming job on yarn with 8 execuors with 1 core for each one. So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously

Re: Spark shell and StackOverFlowError

2015-09-01 Thread ponkin
Hi, Can not reproduce your error on Spark 1.2.1 . It is not enough information. What is your command line arguments wцру you starting spark-shell? what data are you reading? etc. -- View this message in context:

Re: Spark executor OOM issue on YARN

2015-09-01 Thread ponkin
Hi, Can you please post your stack trace with exceptions? and also command line attributes in spark-submit? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html Sent from the Apache Spark User List mailing list

Re: query avro hive table in spark sql

2015-08-27 Thread ponkin
Can you select something from this table using Hive? And also could you post your spark code which leads to this exception. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html Sent from the Apache Spark User

Re: error accessing vertexRDD

2015-08-27 Thread ponkin
Check permission for user which runs spark-shell (Permission denied) - means that you do not have permissions to /tmp -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html Sent from the Apache Spark User List mailing

Re: use GraphX with Spark Streaming

2015-08-25 Thread ponkin
Hi, Sure you can. StreamingContext has property /def sparkContext: SparkContext/(see docs http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext ). Think about DStream - main abstraction in Spark Streaming, as a sequence of RDD. Each DStream can be

Re: Spark Number of Partitions Recommendations

2015-07-29 Thread ponkin
Hi Rahul, Where did you see such a recommendation? I personally define partitions with the following formula partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) ) where nextPrimeNumberAbove(x) - prime number which is greater than x K - multiplicator to calculate start

Re: How do we control output part files created by Spark job?

2015-07-07 Thread ponkin
Hi, Did you try to reduce number of executors and cores? usually num-executors * executor-cores = number of parallel tasks, so you can reduce number of parallel tasks in command line like ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors