How to separate messages of different topics.

2015-05-05 Thread Guillermo Ortiz
I want to read from many topics in Kafka and know from where each message is coming (topic1, topic2 and so on). val kafkaParams = Map[String, String](metadata.broker.list - myKafka:9092) val topics = Set(EntryLog, presOpManager) val directKafkaStream = KafkaUtils.createDirectStream[String,

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark

Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
I'm tryting to execute the Hello World example with Spark + Kafka ( https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala) with createDirectStream and I get this error. java.lang.NoSuchMethodError:

RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao
56mb / 26mb is very small size, do you observe data skew? More precisely, many records with the same chrname / name? And can you also double check the jvm settings for the executor process? From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 5, 2015 7:50 PM To: Cheng,

multiple hdfs folder files input to PySpark

2015-05-05 Thread Oleg Ruchovets
Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure day1 file1.txt file2.txt day2 file1.txt file2.txt ... Question: 1) What is the way to provide as an input for PySpark job

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta
Production - not whole lot of companies have implemented Spark in production and so though it is good to have, not must. If you are on LinkedIn, a group of folks including myself are preparing for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM,

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha
I suggest you to try with date_dim.d_year in the query On Tue, May 5, 2015 at 10:47 PM, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi Ankit, printSchema() works fine for all the tables. hiveStoreSalesDF.printSchema() root |-- store_sales.ss_sold_date_sk: integer

Re: Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
Sorry, I had a duplicated kafka dependency with another older version in another pom.xml 2015-05-05 14:46 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: I'm tryting to execute the Hello World example with Spark + Kafka (

Re: JAVA for SPARK certification

2015-05-05 Thread Zoltán Zvara
I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta kartik.meht...@gmail.com wrote: Production - not whole lot of companies have implemented

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh
Hi Ankit, printSchema() works fine for all the tables. hiveStoreSalesDF.printSchema() root |-- store_sales.ss_sold_date_sk: integer (nullable = true) |-- store_sales.ss_sold_time_sk: integer (nullable = true) |-- store_sales.ss_item_sk: integer (nullable = true) |-- store_sales.ss_customer_sk:

Re: Maximum Core Utilization

2015-05-05 Thread ayan guha
Also, if not already done, you may want to try repartition your data to 50 partition s On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote: Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large

AvroFiles

2015-05-05 Thread Pankaj Deshpande
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a

Re: what does Container exited with a non-zero exit code 10 means?

2015-05-05 Thread Marcelo Vanzin
What Spark tarball are you using? You may want to try the one for hadoop 2.6 (the one for hadoop 2.4 may cause that issue, IIRC). On Tue, May 5, 2015 at 6:54 PM, felicia shsh...@tsmc.com wrote: Hi all, We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, and we get this error

Re: AvroFiles

2015-05-05 Thread Pankaj Deshpande
I am not using kyro. I was using the regular sqlcontext.avrofiles to open. The files loads properly with the schema. Exception happens when I try to read it. Will try kyro serializer and see if that helps. On May 5, 2015 9:02 PM, Todd Nist tsind...@gmail.com wrote: Are you using Kryo or Java

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
Another thing - could it be a permission problem ? It creates all the directory structure (in red)/tmp/wordcount/ _temporary/0/_temporary/attempt_201505051439_0001_m_01_3/part-1 so I am guessing not. On Tue, May 5, 2015 at 7:27 PM, Sudarshan Murty njmu...@gmail.com wrote: You are

RE: Multilabel Classification in spark

2015-05-05 Thread Ulanov, Alexander
If you are interested in multilabel (not multiclass), you might want to take a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is supposed to perform one-versus-all transformation on classes, which is usually how multilabel classifiers are built. Alexander From: Joseph

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ted Yu
Can you give us a bit more information ? Such as release of Spark you're using, version of Scala, etc. Thanks On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote: I am getting on following code Error:(164, 25) *overloaded method constructor Strategy with alternatives:* (algo:

Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread Rendy Bambang Junior
Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS? What are concerns on doing this? There are no processing to be done by Spark, only to store data to HDFS from Kafka for storage and for further Spark processing Rendy

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan Murty
You are most probably right. I assumed others may have run into this. When I try to put the files in there, it creates a directory structure with the part-0 and part1 files but these files are of size 0 - no content. The client error and the server logs have the error message shown -

Re: Join between Streaming data vs Historical Data in spark

2015-05-05 Thread Rendy Bambang Junior
Thanks. Since join will be done in regular basis in short period of time ( let say 20s) do you have any suggestions how to make it faster? I am thinking of partitioning data set and cache it. Rendy On Apr 30, 2015 6:31 AM, Tathagata Das t...@databricks.com wrote: Have you taken a look at the

MLlib libsvm isssues with data

2015-05-05 Thread doyere
hi all: I’ve met a issues with MLlib.I used posted to the community seems put the wrong place:( .Then I put in stackoverflowf.for a good format details plz seehttp://stackoverflow.com/questions/30048344/spark-mllib-libsvm-isssues-with-data.hope someone could help  I guess it’s due to my

what does Container exited with a non-zero exit code 10 means?

2015-05-05 Thread felicia
Hi all, We're trying to implement SparkSQL on CDH5.3.0 with cluster mode, and we get this error either using java or python; Application application_1430482716098_0607 failed 2 times due to AM Container for appattempt_1430482716098_0607_02 exited with exitCode: 10 due to: Exception from

Re: Two DataFrames with different schema, unionAll issue.

2015-05-05 Thread Michael Armbrust
You need to add a select clause to at least one dataframe to give them the same schema before you can union them (much like in SQL). On Tue, May 5, 2015 at 3:24 AM, Wilhelm niznik.pa...@gmail.com wrote: Hey there, 1.) I'm loading 2 avro files with that have slightly different schema df1 =

Re: overloaded method constructor Strategy with alternatives

2015-05-05 Thread Ash G
I am using Spark 1.3.0 and Scala 2.10. Thanks On Tue, May 5, 2015 at 6:48 PM, Ted Yu yuzhih...@gmail.com wrote: Can you give us a bit more information ? Such as release of Spark you're using, version of Scala, etc. Thanks On Tue, May 5, 2015 at 6:37 PM, xweb ashish8...@gmail.com wrote:

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context:

Re: AvroFiles

2015-05-05 Thread Todd Nist
Are you using Kryo or Java serialization? I found this post useful: http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist If using kryo, you need to register the classes with kryo, something like this: sc.registerKryoClasses(Array(

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-05 Thread MrAsanjar .
why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Hi all, I am planning to load data from Kafka to HDFS. Is it normal to use spark streaming to load data from Kafka to HDFS?

Re: How to skip corrupted avro files

2015-05-05 Thread Imran Rashid
You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but you could copy the code for HadoopReliableRDD in the PR into your own code and use it, without having to wait for the

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x= number of partitions. Depend on your case, repartition could be a heavy task Regards.

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
Very interested @Kartik/Zoltan. Please let me know how to connect on LI On Tue, May 5, 2015 at 11:47 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-05 Thread Imran Rashid
Gerard is totally correct -- to expand a little more, I think what you want to do is a solrInputDocumentJavaRDD.foreachPartition, instead of solrInputDocumentJavaRDD.foreach: solrInputDocumentJavaRDD.foreachPartition( new VoidFunctionIteratorSolrInputDocument() { @Override public void

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to

Possible to disable Spark HTTP server ?

2015-05-05 Thread roy
Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache

Re: Spark job concurrency problem

2015-05-05 Thread Imran Rashid
can you give your entire spark submit command? Are you missing --executor-cores num_cpu? Also, if you intend to use all 6 nodes, you also need --num-executors 6 On Mon, May 4, 2015 at 2:07 AM, Xi Shen davidshe...@gmail.com wrote: Hi, I have two small RDD, each has about 600 records. In my

Where does Spark persist RDDs on disk?

2015-05-05 Thread Haoliang Quan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed? Thank

Where does Spark persist RDDs on disk?

2015-05-05 Thread hquan
Hi, I'm using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY. I think there might be something wrong with my code... So where can I find the persisted RDDs on disk so that I can make sure they were persisted indeed?

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Imran Rashid
Are you setting a really large max buffer size for kryo? Was this fixed by https://issues.apache.org/jira/browse/SPARK-6405 ? If not, we should open up another issue to get a better warning in these cases. On Tue, May 5, 2015 at 2:47 AM, shahab shahab.mok...@gmail.com wrote: Thanks Tristan

Re: How to separate messages of different topics.

2015-05-05 Thread Cody Koeninger
Make sure to read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md The directStream / KafkaRDD has a 1 : 1 relationship between kafka topic/partition and spark partition. So a given spark partition only has messages from 1 kafka topic. You can tell what topic that is

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count

Re: JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
Hi, I think all the required materials for reference are mentioned here: http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification My question was regarding the proficiency level required for Java. There are detailed examples and code mentioned for JAVA, Python

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
If you mean multilabel (predicting multiple label values), then MLlib does not yet support that. You would need to predict each label separately. If you mean multiclass (1 label taking 2 categorical values), then MLlib supports it via LogisticRegression (as DB said), as well as DecisionTree and

Spark SQL Standalone mode missing parquet?

2015-05-05 Thread Manu Mukerji
Hi All, When I try and run Spark SQL in standalone mode it appears to be missing the parquet jar, I have to pass it as -jars and that works.. sbin/start-thriftserver.sh --jars lib/parquet-hive-bundle-1.6.0.jar --driver-memory 28g --master local[10] Any ideas on why? I downloaded the one pre

Re: Possible to disable Spark HTTP server ?

2015-05-05 Thread Ted Yu
SPARK-3490 introduced spark.ui.enabled FYI On Tue, May 5, 2015 at 8:41 AM, roy rp...@njit.edu wrote: Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context:

Re: How to skip corrupted avro files

2015-05-05 Thread Shing Hing Man
Thanks for the info ! Shing On Tuesday, 5 May 2015, 15:11, Imran Rashid iras...@cloudera.com wrote: You might be interested in https://issues.apache.org/jira/browse/SPARK-6593 and the discussion around the PRs. This is probably more complicated than what you are looking for, but

Inserting Nulls

2015-05-05 Thread Masf
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1:

Spark applications Web UI at 4040 doesn't exist

2015-05-05 Thread marco.doncel
Hi all, I'm not able to access to the Spark Streaming running applications that I'm submitting to the EC2 standalone cluster (spark 1.3.1) via port 4040. The problem is that I don't even see running applications in the master's web UI (I do see running drivers). This is the command I use to

Number of files to load

2015-05-05 Thread Rendy Bambang Junior
Let say I am storing my data in HDFS with folder structure and file partitioning as per below: /analytics/2015/05/02/partition-2015-05-02-13-50- Note that new file is created every 5 minutes. As per my understanding, storing 5minutes file means we could not create RDD more granular than

Re: Number of files to load

2015-05-05 Thread Jonathan Coveney
As per my understanding, storing 5minutes file means we could not create RDD more granular than 5minutes. This depends on the file format. Many file formats are splittable (like parquet), meaning that you can seek into various points of the file. 2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior

saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread Sudarshan
I have searched all replies to this question not found an answer.I am running standalone Spark 1.3.1 and Hortonwork's HDP 2.2 VM, side by side, on the same machine and trying to write output of wordcount program into HDFS (works fine writing to a local file, /tmp/wordcount).Only line I added to

Re: Inserting Nulls

2015-05-05 Thread Michael Armbrust
Option only works when you are going from case classes. Just put null into the Row, when you want the value to be null. On Tue, May 5, 2015 at 9:00 AM, Masf masfwo...@gmail.com wrote: Hi. I have a spark application where I store the results into table (with HiveContext). Some of these

Re: spark sql, creating literal columns in java.

2015-05-05 Thread Michael Armbrust
This should work from java too: http://spark.apache.org/docs/1.3.1/api/java/index.html#org.apache.spark.sql.functions$ On Tue, May 5, 2015 at 4:15 AM, Jan-Paul Bultmann janpaulbultm...@me.com wrote: Hey, What is the recommended way to create literal columns in java? Scala has the `lit`

Parquet Partition Strategy - how to partition data correctly

2015-05-05 Thread Todd Nist
Hi, I have a DataFrame that represents my data looks like this: +-++ | col_name| data_type | +-++ | obj_id | string | | type| string | | name

Re: spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread jeanlyn
Have you config the SPARK_CLASSPATH with the jar of mysql in spark-env.sh?For example (export SPARK_CLASSPATH+=:/path/to/mysql-connector-java-5.1.18-bin.jar) ?? 2015??5??53:32 980548...@qq.com ?? my metastore is like this property

Re: Re: sparksql running slow while joining 2 tables.

2015-05-05 Thread Olivier Girardot
Can you activate your eventLogs and send them us ? Thank you ! Le mar. 5 mai 2015 à 04:56, luohui20001 luohui20...@sina.com a écrit : Yes,just by default 1 executor.thanks 发自我的小米手机 在 2015年5月4日 下午10:01,ayan guha guha.a...@gmail.com写道: Are you using only 1 executor? On Mon, May 4, 2015

Re: Is LIMIT n in Spark SQL useful?

2015-05-05 Thread Robin East
Michael Are there plans to add LIMIT push down? It's quite a natural thing to do in interactive querying. Sent from my iPhone On 4 May 2015, at 22:57, Michael Armbrust mich...@databricks.com wrote: The JDBC interface for Spark SQL does not support pushing down limits today. On Mon, May

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread shahab
Thanks Tristan for sharing this. Actually this happens when I am reading a csv file of 3.5 GB. best, /Shahab On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers tris...@blackfrog.org wrote: Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-05 Thread Akhil Das
It could be filling up your /tmp directory. You need to set your spark.local.dir or you can also specify SPARK_WORKER_DIR to another location which has sufficient space. Thanks Best Regards On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am getting No space left

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread Tristan Blakers
Hi Shahab, I’ve seen exceptions very similar to this (it also manifests as negative array size exception), and I believe it’s a really bug in Kryo. See this thread:

RE: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread Wang, Daoyuan
How did you configure your metastore? Thanks, Daoyuan From: 鹰 [mailto:980548...@qq.com] Sent: Tuesday, May 05, 2015 3:11 PM To: luohui20001 Cc: user Subject: 回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient hi luo, thanks for your reply in fact I can use hive

??????RE: ??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
my metastore is like this property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://192.168.1.40:3306/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionDriver

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com wrote: Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned on

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned on partition 0 but then I see this record arriving in both Yarn containers (I see it in the logs). Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey

??????spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
hi luo, thanks for your reply in fact I can use hive by spark on my spark master machine, but when I copy my spark files to another machine and when I want to access the hive by spark get the error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient , I have copy

?????? spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-05 Thread ??
thanks jeanlyn itworks -- -- ??: jeanlyn;oujianl...@jd.com; : 2015??5??5??(??) 3:40 ??: ??980548...@qq.com; : Wang, Daoyuandaoyuan.w...@intel.com; useruser@spark.apache.org; : Re: spark Unable to instantiate

RE: Unable to join table across data sources using sparkSQL

2015-05-05 Thread Ishwardeep Singh
Hi , I am using Spark 1.3.0. I was able to join a JSON file on HDFS registered as a TempTable with a table in MySQL. On the same lines I tried to join a table in Hive with another table in Teradata but I get a query parse exception. Regards, Ishwardeep From: ankitjindal [via Apache Spark

Re: Event generator for SPARK-Streaming from csv

2015-05-05 Thread anshu shukla
I know these methods , but i need to create events using the timestamps in the data tuples ,means every time a new tuple is generated using the timestamp in a CSV file .this will be useful to simulate the data rate with time just like real sensor data . On Fri, May 1, 2015 at 2:52 PM, Juan

Maximum Core Utilization

2015-05-05 Thread Manu Kaul
Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large number of cores available (like 100 cores), the Spark system seems to stop after using just 4 cores and after that the runtime is pretty much a straight line

RE: Remoting warning when submitting to cluster

2015-05-05 Thread Javier Delgadillo
I downloaded the 1.3.1 source distribution and built on Windows (laptop 8.0 and desktop 8.1) Here’s what I’m running: Desktop: Spark Master (%SPARK_HOME%\bin\spark-class2.cmd org.apache.spark.deploy.master.Master -h desktop --port 7077) Spark Worker (%SPARK_HOME%\bin\spark-class2.cmd

Re: Maximum Core Utilization

2015-05-05 Thread Richard Marscher
Hi, do you have information on how many partitions/tasks the stage/job is running? By default there is 1 core per task, and your number of concurrent tasks may be limiting core utilization. There are a few settings you could play with, assuming your issue is related to the above:

Multilabel Classification in spark

2015-05-05 Thread peterg
Hi all, I'm looking to implement a Multilabel classification algorithm but I am surprised to find that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context:

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote: Hi all, I'm looking to implement a Multilabel

Help with datetime comparison in SparkSQL statement ...

2015-05-05 Thread subscripti...@prismalytics.io
Hello Friends: Here's sample output from a SparkSQL query that works, just so you can see the underlying data structure; followed by one that fails. # Just you you can see the DataFrame structure ... resultsRDD = sqlCtx.sql(SELECT * FROM rides WHERE trip_time_in_secs = 3780)

example code for current date in spark sql

2015-05-05 Thread kiran mavatoor
Hi, In Hive , I am using unix_timestamp() as 'update_on' to insert current date in 'update_on' column of the table. Now I am converting it into spark sql. Please suggest example code to insert current date and time into column of the table using spark sql.  CheersKiran.

setting spark configuration properties problem

2015-05-05 Thread Hafiz Mujadid
Hi all, i have declared spark context at start of my program and then i want to change it's configurations at some later stage in my code as written below val conf = new SparkConf().setAppName(Cassandra Demo) var sc:SparkContext=new SparkContext(conf)

JAVA for SPARK certification

2015-05-05 Thread Gourav Sengupta
Hi, how important is JAVA for Spark certification? Will learning only Python and Scala not work? Regards, Gourav

Map one RDD into two RDD

2015-05-05 Thread Bill Q
Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save each type into its own RDD. But I can't seem to find an efficient way to do it. Any suggestions? Many thanks. Bill -- Many

Re: Map one RDD into two RDD

2015-05-05 Thread Ted Yu
Have you looked at RDD#randomSplit() (as example) ? Cheers On Tue, May 5, 2015 at 2:42 PM, Bill Q bill.q@gmail.com wrote: Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD, I will generate two types of data. I would like to save

Configuring Number of Nodes with Standalone Scheduler

2015-05-05 Thread Nastooh Avessta (navesta)
Hi I have a 1.0.0 cluster with multiple worker nodes that deploy a number of external tasks, through getRuntime().exec. Currently I have no control on how many nodes get deployed for a given task. At times scheduler evenly distributes the executors among all nodes and at other times it only

Re: JAVA for SPARK certification

2015-05-05 Thread Kartik Mehta
I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm. Thanks, Kartik On May 5, 2015 5:57 AM, Gourav Sengupta gourav.sengu...@gmail.com wrote: Hi, how important is

Re: JAVA for SPARK certification

2015-05-05 Thread Stephen Boesch
There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark written in scala, having done in Scala will be ok for certification. If someone who has done certification can confirm.

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail:

Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li
Hi, Spark experts: I did rdd.coalesce(numPartitions).saveAsSequenceFile(dir) in my code, which generates the rdd's in streamed batches. It generates numPartitions of files as expected with names dir/part-x. However, the first couple of files (e.g., part-0, part-1) have many times of