Hi,
I need to build huge BloomFilter with 150 millions or even more insertions
import org.apache.spark.util.sketch.BloomFilter
val bf = spark.read.avro("/hdfs/path").filter("some ==
1").stat.bloomFilter("id", 15000, 0.01)
if I use keys for serialization
implicit val bfEncoder =
=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?
Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-shell-inside-Docker-container-against-remote-YARN-cluster-tp27967.html
I generated CSV file with 300 columns, and it seems to work fine with Spark
Dataframes(Spark 2.0).
I think you need to post your issue in spark-cassandra-connector community
(https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user)
- if you are using it.
--
View this
Did you try to load wide, for example, CSV file or Parquet? May be the
problem is in spark-cassandra-connector not Spark itself? Are you using
spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)?
--
View this message in context:
Hi,
What kind of datasource do you have? CSV, Avro, Parquet?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
Here is my use case :
I have kafka topic. The job is fairly simple - it reads topic and save data to
several hdfs paths.
I create rdd with the following code
val r =
KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range)
Then I am trying to
checkpoints)? Is
it possible at all or I need to store offsets in external datastore?
Alexey Ponkin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-KafkaUtils-createDirectStream-how-to-start-streming-from-checkpoints-tp25461.html
Sent from t
Hi
I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) -
read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra.
The problem is that when I increase number of incoming messages in topic the
job is starting to fail with
Koeninger" <c...@koeninger.org>:
> Can you provide more info (what version of spark, code example)?
>
> On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote:
>> Hi,
>>
>> I have an application with 2 streams, which are joined together.
Hi,
I have an application with 2 streams, which are joined together.
Stream1 - is simple DStream(relativly small size batch chunks)
Stream2 - is a windowed DStream(with duration for example 60 seconds)
Stream1 and Stream2 are Kafka direct stream.
The problem is that according to logs window
Hi,
I have the following code
object MyJob extends org.apache.spark.Logging{
...
val source: DStream[SomeType] ...
source.foreachRDD { rdd =>
logInfo(s"""+++ForEachRDD+++""")
rdd.foreachPartition { partitionOfRecords =>
logInfo(s"""+++ForEachPartition+++""")
}
}
I
Hi,
I am trying to read kafka topic with new directStream method in KafkaUtils.
I have Kafka topic with 8 partitions.
I am running streaming job on yarn with 8 execuors with 1 core for each
one.
So noticed that spark reads all topic's partitions in one executor
sequentially - this is obviously
Hi,
Can not reproduce your error on Spark 1.2.1 . It is not enough information.
What is your command line arguments wцру you starting spark-shell? what data
are you reading? etc.
--
View this message in context:
Hi,
Can you please post your stack trace with exceptions? and also command line
attributes in spark-submit?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-OOM-issue-on-YARN-tp24522p24530.html
Sent from the Apache Spark User List mailing list
Can you select something from this table using Hive? And also could you post
your spark code which leads to this exception.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html
Sent from the Apache Spark User
Check permission for user which runs spark-shell
(Permission denied) - means that you do not have permissions to /tmp
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html
Sent from the Apache Spark User List mailing
Hi,
Sure you can. StreamingContext has property /def sparkContext:
SparkContext/(see docs
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
). Think about DStream - main abstraction in Spark Streaming, as a sequence
of RDD. Each DStream can be
Hi Rahul,
Where did you see such a recommendation?
I personally define partitions with the following formula
partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )
where
nextPrimeNumberAbove(x) - prime number which is greater than x
K - multiplicator to calculate start
Hi,
Did you try to reduce number of executors and cores? usually num-executors *
executor-cores = number of parallel tasks, so you can reduce number of
parallel tasks in command line like
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors
19 matches
Mail list logo