I want to read from many topics in Kafka and know from where each message
is coming (topic1, topic2 and so on).
val kafkaParams = Map[String, String](metadata.broker.list -
myKafka:9092)
val topics = Set(EntryLog, presOpManager)
val directKafkaStream = KafkaUtils.createDirectStream[String,
And how important is to have production environment?
On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote:
There are questions in all three languages.
2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:
I too have similar question.
My understanding is since Spark
I'm tryting to execute the Hello World example with Spark + Kafka (
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala)
with createDirectStream and I get this error.
java.lang.NoSuchMethodError:
56mb / 26mb is very small size, do you observe data skew? More precisely, many
records with the same chrname / name? And can you also double check the jvm
settings for the executor process?
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng,
Hi
We are using pyspark 1.3 and input is text files located on hdfs.
file structure
day1
file1.txt
file2.txt
day2
file1.txt
file2.txt
...
Question:
1) What is the way to provide as an input for PySpark job
Production - not whole lot of companies have implemented Spark in
production and so though it is good to have, not must.
If you are on LinkedIn, a group of folks including myself are preparing for
Spark certification, learning in group makes learning easy and fun.
Kartik
On May 5, 2015 7:31 AM,
I suggest you to try with date_dim.d_year in the query
On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh
ishwardeep.si...@impetus.co.in wrote:
Hi Ankit,
printSchema() works fine for all the tables.
hiveStoreSalesDF.printSchema()
root
|-- store_sales.ss_sold_date_sk: integer
Sorry, I had a duplicated kafka dependency with another older version in
another pom.xml
2015-05-05 14:46 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com:
I'm tryting to execute the Hello World example with Spark + Kafka (
I might join in to this conversation with an ask. Would someone point me to
a decent exercise that would approximate the level of this exam (from
above)? Thanks!
On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com
wrote:
Production - not whole lot of companies have implemented
Hi Ankit,
printSchema() works fine for all the tables.
hiveStoreSalesDF.printSchema()
root
|-- store_sales.ss_sold_date_sk: integer (nullable = true)
|-- store_sales.ss_sold_time_sk: integer (nullable = true)
|-- store_sales.ss_item_sk: integer (nullable = true)
|-- store_sales.ss_customer_sk:
Also, if not already done, you may want to try repartition your data to 50
partition s
On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote:
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
file was created using Avro 1.7.7. Similar to the example mentioned in
http://www.infoobjects.com/spark-with-avro/
I am getting a nullPointerException on Schema read. It could be a avro
version mismatch. Has anybody had a
What Spark tarball are you using? You may want to try the one for hadoop
2.6 (the one for hadoop 2.4 may cause that issue, IIRC).
On Tue, May 5, 2015 at 6:54 PM, felicia shsh...@tsmc.com wrote:
Hi all,
We're trying to implement SparkSQL on CDH5.3.0 with cluster mode,
and we get this error
I am not using kyro. I was using the regular sqlcontext.avrofiles to open.
The files loads properly with the schema. Exception happens when I try to
read it. Will try kyro serializer and see if that helps.
On May 5, 2015 9:02 PM, Todd Nist tsind...@gmail.com wrote:
Are you using Kryo or Java
Another thing - could it be a permission problem ?
It creates all the directory structure (in red)/tmp/wordcount/
_temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1
so I am guessing not.
On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty njmu...@gmail.com wrote:
You are
If you are interested in multilabel (not multiclass), you might want to take a
look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is
supposed to perform one-versus-all transformation on classes, which is usually
how multilabel classifiers are built.
Alexander
From: Joseph
Can you give us a bit more information ?
Such as release of Spark you're using, version of Scala, etc.
Thanks
On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote:
I am getting on following code
Error:(164, 25) *overloaded method constructor Strategy with alternatives:*
(algo:
Hi all,
I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS? What are concerns on doing this?
There are no processing to be done by Spark, only to store data to HDFS
from Kafka for storage and for further Spark processing
Rendy
You are most probably right. I assumed others may have run into this.
When I try to put the files in there, it creates a directory structure with
the part-0 and part1 files but these files are of size 0 - no
content. The client error and the server logs have the error message shown
-
Thanks.
Since join will be done in regular basis in short period of time ( let say
20s) do you have any suggestions how to make it faster?
I am thinking of partitioning data set and cache it.
Rendy
On Apr 30, 2015 6:31 AM, Tathagata Das t...@databricks.com wrote:
Have you taken a look at the
hi all:
I’ve met a issues with MLlib.I used posted to the community seems put the wrong
place:( .Then I put in stackoverflowf.for a good format details plz
seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope
someone could help
I guess it’s due to my
Hi all,
We're trying to implement SparkSQL on CDH5.3.0 with cluster mode,
and we get this error either using java or python;
Application application_1430482716098_0607 failed 2 times due to AM
Container for appattempt_1430482716098_0607_02 exited with exitCode: 10
due to: Exception from
You need to add a select clause to at least one dataframe to give them the
same schema before you can union them (much like in SQL).
On Tue, May 5, 2015 at 3:24 AM, Wilhelm niznik.pa...@gmail.com wrote:
Hey there,
1.) I'm loading 2 avro files with that have slightly different schema
df1 =
I am using Spark 1.3.0 and Scala 2.10.
Thanks
On Tue, May 5, 2015 at 6:48 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you give us a bit more information ?
Such as release of Spark you're using, version of Scala, etc.
Thanks
On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote:
I am running hive queries from HiveContext, for which we need a
hive-site.xml.
Is it possible to replace it with hive-config.xml? I tried but does not
work. Just want a conformation.
--
View this message in context:
Are you using Kryo or Java serialization? I found this post useful:
http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
If using kryo, you need to register the classes with kryo, something like
this:
sc.registerKryoClasses(Array(
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior
rendy.b.jun...@gmail.com wrote:
Hi all,
I am planning to load data from Kafka to HDFS. Is it normal to use spark
streaming to load data from Kafka to HDFS?
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593
and the discussion around the PRs.
This is probably more complicated than what you are looking for, but you
could copy the code for HadoopReliableRDD in the PR into your own code and
use it, without having to wait for the
Hi Eric.
Q1:
When I read parquet files, I've tested that Spark generates so many
partitions as parquet files exist in the path.
Q2:
To reduce the number of partitions you can use rdd.repartition(x), x=
number of partitions. Depend on your case, repartition could be a heavy task
Regards.
Very interested @Kartik/Zoltan. Please let me know how to connect on LI
On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:
I might join in to this conversation with an ask. Would someone point me
to a decent exercise that would approximate the level of this exam (from
Gerard is totally correct -- to expand a little more, I think what you want
to do is a solrInputDocumentJavaRDD.foreachPartition, instead of
solrInputDocumentJavaRDD.foreach:
solrInputDocumentJavaRDD.foreachPartition(
new VoidFunctionIteratorSolrInputDocument() {
@Override
public void
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.
I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to
Hi,
When we start spark job it start new HTTP server for each new job.
Is it possible to disable HTTP server for each job ?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html
Sent from the Apache
can you give your entire spark submit command? Are you missing
--executor-cores num_cpu? Also, if you intend to use all 6 nodes, you
also need --num-executors 6
On Mon, May 4, 2015 at 2:07 AM, Xi Shen davidshe...@gmail.com wrote:
Hi,
I have two small RDD, each has about 600 records. In my
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs
on disk so that I can make sure they were persisted indeed?
Thank
Hi,
I'm using persist on different storage levels, but I found no difference on
performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might
be something wrong with my code... So where can I find the persisted RDDs on
disk so that I can make sure they were persisted indeed?
Are you setting a really large max buffer size for kryo?
Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ?
If not, we should open up another issue to get a better warning in these
cases.
On Tue, May 5, 2015 at 2:47 AM, shahab shahab.mok...@gmail.com wrote:
Thanks Tristan
Make sure to read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
The directStream / KafkaRDD has a 1 : 1 relationship between kafka
topic/partition and spark partition. So a given spark partition only has
messages from 1 kafka topic. You can tell what topic that is
Hello guys
Q1: How does Spark determine the number of partitions when reading a Parquet
file?
val df = sqlContext.parquetFile(path)
Is it some way related to the number of Parquet row groups in my input?
Q2: How can I reduce this number of partitions? Doing this:
df.rdd.coalesce(200).count
Hi,
I think all the required materials for reference are mentioned here:
http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification
My question was regarding the proficiency level required for Java. There
are detailed examples and code mentioned for JAVA, Python
Hi all,
We will drop support for Java 6 starting Spark 1.5, tentative scheduled to
be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015,
will be the last minor release that supports Java 6. That is to say:
Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.
Spark 1.5+ (~
What happens when you try to put files to your hdfs from local filesystem?
Looks like its a hdfs issue rather than spark thing.
On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote:
I have searched all replies to this question not found an answer.
I am running standalone Spark 1.3.1 and
If you mean multilabel (predicting multiple label values), then MLlib
does not yet support that. You would need to predict each label separately.
If you mean multiclass (1 label taking 2 categorical values), then MLlib
supports it via LogisticRegression (as DB said), as well as DecisionTree
and
Hi All,
When I try and run Spark SQL in standalone mode it appears to be missing
the parquet jar, I have to pass it as -jars and that works..
sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar
--driver-memory 28g --master local[10]
Any ideas on why? I downloaded the one pre
SPARK-3490 introduced spark.ui.enabled
FYI
On Tue, May 5, 2015 at 8:41 AM, roy rp...@njit.edu wrote:
Hi,
When we start spark job it start new HTTP server for each new job.
Is it possible to disable HTTP server for each job ?
Thanks
--
View this message in context:
Thanks for the info !
Shing
On Tuesday, 5 May 2015, 15:11, Imran Rashid iras...@cloudera.com wrote:
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593
and the discussion around the PRs.
This is probably more complicated than what you are looking for, but
Hi.
I have a spark application where I store the results into table (with
HiveContext). Some of these columns allow nulls. In Scala, this columns are
represented through Option[Int] or Option[Double].. Depend on the data type.
For example:
*val hc = new HiveContext(sc)*
*var col1:
Hi all,
I'm not able to access to the Spark Streaming running applications that I'm
submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The
problem is that I don't even see running applications in the master's web UI
(I do see running drivers). This is the command I use to
Let say I am storing my data in HDFS with folder structure and file
partitioning as per below:
/analytics/2015/05/02/partition-2015-05-02-13-50-
Note that new file is created every 5 minutes.
As per my understanding, storing 5minutes file means we could not create
RDD more granular than
As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes.
This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.
2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior
I have searched all replies to this question not found an answer.I am
running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on
the same machine and trying to write output of wordcount program into HDFS
(works fine writing to a local file, /tmp/wordcount).Only line I added to
Option only works when you are going from case classes. Just put null into
the Row, when you want the value to be null.
On Tue, May 5, 2015 at 9:00 AM, Masf masfwo...@gmail.com wrote:
Hi.
I have a spark application where I store the results into table (with
HiveContext). Some of these
This should work from java too:
http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$
On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann janpaulbultm...@me.com
wrote:
Hey,
What is the recommended way to create literal columns in java?
Scala has the `lit`
Hi,
I have a DataFrame that represents my data looks like this:
+-++
| col_name| data_type |
+-++
| obj_id | string |
| type| string |
| name
Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For
example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar)
?? 2015??5??53:32 980548...@qq.com ??
my metastore is like this
property
Can you activate your eventLogs and send them us ?
Thank you !
Le mar. 5 mai 2015 à 04:56, luohui20001 luohui20...@sina.com a écrit :
Yes,just by default 1 executor.thanks
发自我的小米手机
在 2015年5月4日 下午10:01,ayan guha guha.a...@gmail.com写道:
Are you using only 1 executor?
On Mon, May 4, 2015
Michael
Are there plans to add LIMIT push down? It's quite a natural thing to do in
interactive querying.
Sent from my iPhone
On 4 May 2015, at 22:57, Michael Armbrust mich...@databricks.com wrote:
The JDBC interface for Spark SQL does not support pushing down limits today.
On Mon, May
Thanks Tristan for sharing this. Actually this happens when I am reading a
csv file of 3.5 GB.
best,
/Shahab
On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org
wrote:
Hi Shahab,
I’ve seen exceptions very similar to this (it also manifests as negative
array size
It could be filling up your /tmp directory. You need to set your
spark.local.dir or you can also specify SPARK_WORKER_DIR to another
location which has sufficient space.
Thanks
Best Regards
On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
I am getting No space left
Hi Shahab,
I’ve seen exceptions very similar to this (it also manifests as negative
array size exception), and I believe it’s a really bug in Kryo.
See this thread:
How did you configure your metastore?
Thanks,
Daoyuan
From: 鹰 [mailto:980548...@qq.com]
Sent: Tuesday, May 05, 2015 3:11 PM
To: luohui20001
Cc: user
Subject: 回复:spark Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
hi luo,
thanks for your reply in fact I can use hive
my metastore is like this
property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://192.168.1.40:3306/hive/value
/property
property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
descriptionDriver
Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com
wrote:
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on partition 0 but then I see
this record arriving in both Yarn containers (I see it in the logs).
Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
hi luo,
thanks for your reply in fact I can use hive by spark on my spark master
machine, but when I copy my spark files to another machine and when I want to
access the hive by spark get the error Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy
thanks jeanlyn itworks
-- --
??: jeanlyn;oujianl...@jd.com;
: 2015??5??5??(??) 3:40
??: ??980548...@qq.com;
: Wang, Daoyuandaoyuan.w...@intel.com; useruser@spark.apache.org;
: Re: spark Unable to instantiate
Hi ,
I am using Spark 1.3.0.
I was able to join a JSON file on HDFS registered as a TempTable with a table
in MySQL. On the same lines I tried to join a table in Hive with another table
in Teradata but I get a query parse exception.
Regards,
Ishwardeep
From: ankitjindal [via Apache Spark
I know these methods , but i need to create events using the timestamps in
the data tuples ,means every time a new tuple is generated using the
timestamp in a CSV file .this will be useful to simulate the data rate
with time just like real sensor data .
On Fri, May 1, 2015 at 2:52 PM, Juan
Hi All,
For a job I am running on Spark with a dataset of say 350,000 lines (not
big), I am finding that even though my cluster has a large number of cores
available (like 100 cores), the Spark system seems to stop after using just
4 cores and after that the runtime is pretty much a straight line
I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and
desktop 8.1)
Here’s what I’m running:
Desktop:
Spark Master (%SPARK_HOME%\bin\spark-class2.cmd
org.apache.spark.deploy.master.Master -h desktop --port 7077)
Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd
Hi,
do you have information on how many partitions/tasks the stage/job is
running? By default there is 1 core per task, and your number of concurrent
tasks may be limiting core utilization.
There are a few settings you could play with, assuming your issue is
related to the above:
Hi all,
I'm looking to implement a Multilabel classification algorithm but I am
surprised to find that there are not any in the spark-mllib core library. Am
I missing something? Would someone point me in the right direction?
Thanks!
Peter
--
View this message in context:
LogisticRegression in MLlib package supports multilable classification.
Sincerely,
DB Tsai
---
Blog: https://www.dbtsai.com
On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote:
Hi all,
I'm looking to implement a Multilabel
Hello Friends:
Here's sample output from a SparkSQL query that works, just so you can
see the
underlying data structure; followed by one that fails.
# Just you you can see the DataFrame structure ...
resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE
trip_time_in_secs = 3780)
Hi,
In Hive , I am using unix_timestamp() as 'update_on' to insert current date in
'update_on' column of the table. Now I am converting it into spark sql. Please
suggest example code to insert current date and time into column of the table
using spark sql.
CheersKiran.
Hi all,
i have declared spark context at start of my program and then i want to
change it's configurations at some later stage in my code as written below
val conf = new SparkConf().setAppName(Cassandra Demo)
var sc:SparkContext=new SparkContext(conf)
Hi,
how important is JAVA for Spark certification? Will learning only Python
and Scala not work?
Regards,
Gourav
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input RDD, I will generate two types of data. I would like to
save each type into its own RDD. But I can't seem to find an efficient way
to do it. Any suggestions?
Many thanks.
Bill
--
Many
Have you looked at RDD#randomSplit() (as example) ?
Cheers
On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote:
Hi all,
I have a large RDD that I map a function to it. Based on the nature of
each record in the input RDD, I will generate two types of data. I would
like to save
Hi
I have a 1.0.0 cluster with multiple worker nodes that deploy a number of
external tasks, through getRuntime().exec. Currently I have no control on how
many nodes get deployed for a given task. At times scheduler evenly distributes
the executors among all nodes and at other times it only
I too have similar question.
My understanding is since Spark written in scala, having done in Scala will
be ok for certification.
If someone who has done certification can confirm.
Thanks,
Kartik
On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote:
Hi,
how important is
There are questions in all three languages.
2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com:
I too have similar question.
My understanding is since Spark written in scala, having done in Scala
will be ok for certification.
If someone who has done certification can confirm.
Hey,
What is the recommended way to create literal columns in java?
Scala has the `lit` function from `org.apache.spark.sql.functions`.
Should it be called from java as well?
Cheers jan
-
To unsubscribe, e-mail:
Hi, Spark experts:
I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which
generates the rdd's in streamed batches. It generates numPartitions of files as
expected with names dir/part-x. However, the first couple of files (e.g.,
part-0, part-1) have many times of
84 matches
Mail list logo