[Spark2] huge BloomFilters

2016-11-02 Thread ponkin
Hi,
I need to build huge BloomFilter with 150 millions or even more insertions
import org.apache.spark.util.sketch.BloomFilter
val bf = spark.read.avro("/hdfs/path").filter("some ==
1").stat.bloomFilter("id", 15000, 0.01)

if I use keys for serialization
implicit val bfEncoder = org.apache.spark.sql.Encoders.kryo[BloomFilter]
And then try to save this filter in hdfs
the size of this bloom filter is more than 1G.

Is there any way to compress BloomFilter?
Do anybody have an experience with such a huge bloom filters?

In general I need to check some condition in Spark-streaming application.
I was thinking to use BloomFilters for that.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark2-huge-BloomFilters-tp27991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread ponkin
Hi,
May be someone already had experience to build docker image for spark?
I want to build docker image with spark inside but configured against remote
YARN cluster.
I have already created image with spark 1.6.2 inside.
But when I run 
spark-shell --master yarn --deploy-mode client --driver-memory 32G
--executor-memory 32G --executor-cores 8
inside docker I get the following exception
Diagnostics: java.io.FileNotFoundException: File
file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not
exist

Any suggestions?
Do I need to load spark-assembly i HDFS and set
spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?

Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-shell-inside-Docker-container-against-remote-YARN-cluster-tp27967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
I generated CSV file with 300 columns, and it seems to work fine with Spark
Dataframes(Spark 2.0).
I think you need to post your issue in spark-cassandra-connector community
(https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user)
- if you are using it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
Did you try to load wide, for example, CSV file or Parquet? May be the
problem is in spark-cassandra-connector not Spark itself? Are you using
spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)?
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0 regression when querying very wide data frames

2016-08-20 Thread ponkin
Hi,
What kind of datasource do you have? CSV, Avro, Parquet?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[KafkaRDD]: rdd.cache() does not seem to work

2016-01-11 Thread ponkin
Hi,

Here is my use case :
I have kafka topic. The job is fairly simple - it reads topic and save data to 
several hdfs paths.
I create rdd with the following code
 val r =  
KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range)
Then I am trying to cache that rdd with 
 r.cache()
and then save this rdd to several hdfs locations.
But it seems that KafkaRDD is fetching data from kafka broker every time I call 
saveAsNewAPIHadoopFile.

How can I cache data from Kafka in memory?

P.S. When I do repartition add it seems to work properly( read kafka only once) 
but spark store shuffled data localy.
Is it possible to keep data in memory?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KafkaRDD-rdd-cache-does-not-seem-to-work-tp25936.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-11-24 Thread ponkin
HI,

When I create stream with KafkaUtils.createDirectStream I can explicitly define 
the position "largest" or "smallest" - where to read topic from.
What if I have previous checkpoints( in HDFS for example) with offsets, and I 
want to start reading from the last checkpoint?
In source code of KafkaUtils.createDirectStream I see the following

 val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
 
 (for {
   topicPartitions <- kc.getPartitions(topics).right
   leaderOffsets <- (if (reset == Some("smallest")) {
 kc.getEarliestLeaderOffsets(topicPartitions)
   } else {
 kc.getLatestLeaderOffsets(topicPartitions)
   }).right
...

So it turns out that, I have no options to start reading from checkpoints(and 
offsets)?
Am I right?
How can I force Spark to start reading from saved offesets(in checkpoints)? Is 
it possible at all or I need to store offsets in external datastore?

Alexey Ponkin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Alexey Ponkin
Hi

I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - 
read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra.
The problem is that when I increase number of incoming messages in topic the 
job is starting to fail with kafka.common.OffsetOutOfRangeException.
Job fails starting from 100 events per second.

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok.
Spark 1.4.1 on yarn

Here is my application
I have 4 different Kafka topics(different object streams)

type Edge = (String,String)

val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty 
).map( toEdge )
val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty 
).map( toEdge )
val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty 
).map( toEdge )

val u = a union b union c

val source = u.window(Seconds(600), Seconds(10))

val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty 
).map( toEdge )

val joinResult = source.rightOuterJoin( z )
joinResult.foreachRDD { rdd=>
  rdd.foreachPartition { partition =>
  // save to result topic in kafka
   }
 }

The 'window' function in the code above is constantly growing,
no matter how many events appeared in corresponding kafka topics

but if I change one line from   

val source = u.window(Seconds(600), Seconds(10))

to 

val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8))

val source = u.transform(_.partitionBy(partitioner.value) 
).window(Seconds(600), Seconds(10))

Everything works perfect.

Perhaps the problem was in WindowedDStream

I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) 
instead of UnionRDD.

Nonetheless I did not see any hints about such a bahaviour in doc.
Is it a bug or absolutely normal behaviour?





08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>:
>  Can you provide more info (what version of spark, code example)?
>
>  On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote:
>>  Hi,
>>
>>  I have an application with 2 streams, which are joined together.
>>  Stream1 - is simple DStream(relativly small size batch chunks)
>>  Stream2 - is a windowed DStream(with duration for example 60 seconds)
>>
>>  Stream1 and Stream2 are Kafka direct stream.
>>  The problem is that according to logs window operation is constantly 
>> increasing(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
>>  And also I see gap in pocessing window(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
>>  in logs there are no events in that period.
>>  So what is happen in that gap and why window is constantly insreasing?
>>
>>  Thank you in advance
>>
>>  -
>>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi,

I have an application with 2 streams, which are joined together.
Stream1 - is simple DStream(relativly small size batch chunks)
Stream2 - is a windowed DStream(with duration for example 60 seconds)

Stream1 and Stream2 are Kafka direct stream. 
The problem is that according to logs window operation is constantly 
increasing(http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
And also I see gap in pocessing window(http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
 in logs there are no events in that period.
So what is happen in that gap and why window is constantly insreasing?

Thank you in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Alexey Ponkin
Hi,

I have the following code

object MyJob extends org.apache.spark.Logging{
...
 val source: DStream[SomeType] ...

 source.foreachRDD { rdd =>
  logInfo(s"""+++ForEachRDD+++""")
  rdd.foreachPartition { partitionOfRecords =>
logInfo(s"""+++ForEachPartition+++""")
  }
  }

I was expecting to see both log messages in job log.
But unfortunately you will never see string '+++ForEachPartition+++' in logs, 
cause block foreachPartition will never execute.
And also there is no error message or something in logs.
I wonder is this a bug or known behavior? 
I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
fails with no messages?
What to use instead of org.apache.spark.Logging? in spark-streaming jobs?

P.S. running spark 1.4.1 (on yarn)

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-04 Thread ponkin
Hi,
I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core  for each
one.
So noticed that spark reads all topic's partitions in one executor
sequentially - this is obviously not what I want.
I want spark to read all partitions in parallel. 
How can I achieve that?
 
Thank you, in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-New-directStream-API-reads-topic-s-partitions-sequentially-Why-tp24577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark shell and StackOverFlowError

2015-09-01 Thread ponkin
Hi,
Can not reproduce your error on Spark 1.2.1 . It is not enough information.
What is your command line arguments wцру you starting spark-shell? what data
are you reading? etc.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-shell-and-StackOverFlowError-tp24508p24531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark executor OOM issue on YARN

2015-09-01 Thread ponkin
Hi,
Can you please post your stack trace with exceptions? and also command line
attributes in spark-submit?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




Re: query avro hive table in spark sql

2015-08-27 Thread ponkin
Can you select something from this table using Hive? And also could you post
your spark code which leads to this exception.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error accessing vertexRDD

2015-08-27 Thread ponkin
Check permission for user which runs spark-shell
(Permission denied) - means that you do not have permissions to /tmp



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: use GraphX with Spark Streaming

2015-08-25 Thread ponkin
Hi,
Sure you can. StreamingContext has property /def sparkContext:
SparkContext/(see  docs
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
 
). Think about DStream - main abstraction in Spark Streaming, as a sequence
of RDD. Each DStream can be transform as RDD with method transform(see  docs
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
 
) . So you can use whatever you want depends on your problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/use-GraphX-with-Spark-Streaming-tp24418p24451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Number of Partitions Recommendations

2015-07-29 Thread ponkin
Hi Rahul,

Where did you see such a recommendation?
I personally define partitions with the following formula

partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )

where 
nextPrimeNumberAbove(x) - prime number which is greater than x
K - multiplicator  to calculate start with 1 and encrease untill join
perfomance start to degrade




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do we control output part files created by Spark job?

2015-07-07 Thread ponkin
Hi,
Did you try to reduce number of executors and cores? usually num-executors *
executor-cores = number of parallel tasks, so you can reduce number of
parallel tasks in command line like
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
for more details see
https://spark.apache.org/docs/1.2.0/running-on-yarn.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649p23706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org