Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav
you can also access SparkConf using sc.getConf in Spark shell though for StreamingContext you can directly refer sc as Akhil suggested. On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das ak...@sigmoidanalytics.com wrote: In the shell you could do: val ssc = StreamingContext(*sc*, Seconds(1)) as

Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark configurations are required? Using existing keytab, can any examples be submitted to the secured cluster ? How? Thanks,

[ANNOUNCE] Apache Science and Healthcare Track @ApacheCon NA 2015

2015-01-08 Thread Lewis John Mcgibbney
Hi Folks, Apologies for cross posting :( As some of you may already know, @ApacheCon NA 2015 is happening in Austin, TX April 13th-16th. This email is specifically written to attract all folks interested in Science and Healthcare... this is an official call to arms! I am aware that there are

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I installed the custom as a standalone mode as normal. The master and slaves started successfully. However, I got error when I ran a job. It seems to me from the error message the some library was compiled against hadoop1, but my spark was compiled against hadoop2. 15/01/08 23:27:36 INFO

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
Hi Manoj, As long as you're logged in (i.e. you've run kinit), everything should just work. You can run klist to make sure you're logged in. On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Pl ignore the keytab question for now, the question wasn't fully described Some old communication (Oct 14) says Spark is not certified with Kerberos. Can someone comment on this aspect ? On Thu, Jan 8, 2015 at 3:53 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Manoj, As long as you're

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
I ran this with CDH 5.2 without a problem (sorry don't have 5.3 readily available at the moment): $ HBASE='/opt/cloudera/parcels/CDH/lib/hbase/\*' $ spark-submit --driver-class-path $HBASE --conf spark.executor.extraClassPath=$HBASE --master yarn --class org.apache.spark.examples.HBaseTest

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav
One approach is to first transform this RDD into a PairRDD by taking the field you are going to do aggregation on as key On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to run spark in cdh5.3.0 using its newAPIHadoopRDD? command: spark-submit --master spark://master:7077 --jars /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 4:09 PM, Manoj Samel manojsamelt...@gmail.com wrote: Some old communication (Oct 14) says Spark is not certified with Kerberos. Can someone comment on this aspect ? Spark standalone doesn't support kerberos. Spark running on top of Yarn works fine with kerberos. --

Failed to save RDD as text file to local file system

2015-01-08 Thread NingjunWang
I try to save RDD as text file to local file system (Linux) but it does not work Launch spark-shell and run the following val r = sc.parallelize(Array(a, b, c)) r.saveAsTextFile(file:///home/cloudera/tmp/out1) IOException: Mkdirs failed to create

skipping header from each file

2015-01-08 Thread Hafiz Mujadid
Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile(file1,file2,file3); -- View this message in context:

Re: Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Jörn Franke
Hallo, Based on experiences with other software in virtualized environments I cannot really recommend this. However, I am not sure how Spark reacts. You may face unpredictable task failures depending on utilization, tasks connecting to external systems (databases etc.) may fail unexpectedly and

Parallel execution on one node

2015-01-08 Thread mikens
Hello, I am new to Spark. I have adapted an example code to do binary classification using logistic regression. I tried it on rcv1_train.binary dataset using LBFGS.runLBFGS solver, and obtained correct loss. Now, I'd like to run code in parallel across 16 cores of my single CPU socket. If I

Re: Cannot save RDD as text file to local file system

2015-01-08 Thread Akhil Das
Are you running the program in local mode or in standalone cluster mode? Thanks Best Regards On Fri, Jan 9, 2015 at 10:12 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I try to save RDD as text file to local file system (Linux) but it does not work Launch spark-shell

Re: Getting Output From a Cluster

2015-01-08 Thread Akhil Das
saveAsHadoopFiles requires you to specify the output format which i believe you are not specifying anywhere and hence the program crashes. You could try something like this: Class? extends OutputFormat?,? outputFormatClass = (Class? extends OutputFormat?,?) (Class?)

Re: Join RDDs with DStreams

2015-01-08 Thread Akhil Das
Here's how you do it: val joined_stream = *myStream*.transform((x: RDD[(String, String)]) = { val prdd = new PairRDDFunctions[String, String](x) prdd.join(*myRDD*)}) Thanks Best Regards On Thu, Jan 8, 2015 at 10:20 PM, Asim Jalis asimja...@gmail.com wrote: Is there a way

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-08 Thread Sasi
Thank you Pankaj. We are able to create the Uber JAR (very good to bind all dependency JARs together) and run it on spark-jobserver. One step better than what we are. However, now facing *SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up*. We may need to

Re: skipping header from each file

2015-01-08 Thread Akhil Das
Did you try something like: val file = sc.textFile(/home/akhld/sigmoid/input) val skipped = file.filter(row = !row.contains(header)) skipped.take(10).foreach(println) Thanks Best Regards On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Suppose I

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
1) Thank you everyone for the help once again...the support here is really amazing and I hope to contribute soon! 2) The solution I actually ended up using was from this thread:

Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis asimja...@gmail.com wrote: One approach I was considering was to use mapPartitions. It is straightforward to compute the moving average over a partition, except for near the end point. Does anyone see how to fix that? Well, I guess this is not

Re: Failed to save RDD as text file to local file system

2015-01-08 Thread VISHNU SUBRAMANIAN
looks like it is trying to save the file in Hdfs. Check if you have set any hadoop path in your system. On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Can you check permissions etc as I am able to run r.saveAsTextFile(file:///home/cloudera/tmp/out1)

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. You can take a look. On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I have the similar kind of requirement where I want to push avro data into parquet. But it seems you have

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I call print on the Dstream it works? If I had to do foreachRDD to saveAsHadoopFile, then why is it working for print? Also, if I am doing foreachRDD, do I need connections, or can I simply put the saveAsHadoopFiles inside the

Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24 physical CPU cores. However, we are allowed to set the CPU core number to 48 or more in the spark configuration file. As a result,

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I have the similar kind of requirement where I want to push avro data into parquet. But it seems you have to do it on your own. There is parquet-mr project that uses hadoop to do so. I am trying to write a spark job to do similar kind of thing. On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
Any ideas? :) From: Nathan nathan.mccar...@quantium.com.aumailto:nathan.mccar...@quantium.com.au Date: Wednesday, 7 January 2015 2:53 pm To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL schemaRDD MapPartitions calls -

Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
Hi spark users, I'm using spark SQL to create parquet files on HDFS. I would like to store the avro schema into the parquet meta so that non spark sql applications can marshall the data without avro schema using the avro parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do

Is the Thrift server right for me?

2015-01-08 Thread sjbrunst
I'm building a system that collects data using Spark Streaming, does some processing with it, then saves the data. I want the data to be queried by multiple applications, and it sounds like the Thrift JDBC/ODBC server might be the right tool to handle the queries. However, the documentation for

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav
as per my understanding RDDs do not get replicated, underlying Data does if it's in HDFS. On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 3:33 PM, freedafeng freedaf...@yahoo.com wrote: I installed the custom as a standalone mode as normal. The master and slaves started successfully. However, I got error when I ran a job. It seems to me from the error message the some library was compiled against hadoop1,

Re: Getting Output From a Cluster

2015-01-08 Thread Yana Kadiyska
are you calling the saveAsText files on the DStream --looks like it? Look at the section called Design Patterns for using foreachRDD in the link you sent -- you want to do dstream.foreachRDD(rdd = rdd.saveAs) On Thu, Jan 8, 2015 at 5:20 PM, Su She suhsheka...@gmail.com wrote: Hello

Getting Output From a Cluster

2015-01-08 Thread Su She
Hello Everyone, Thanks in advance for the help! I successfully got my Kafka/Spark WordCount app to print locally. However, I want to run it on a cluster, which means that I will have to save it to HDFS if I want to be able to read the output. I am running Spark 1.1.0, which means according to

SparkSQL

2015-01-08 Thread Abhi Basu
I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark comes with SparkSQL by default. Also, will SparkSQL come enabled to access the Hive Metastore? Is there an easier way to enable Hive support without have to build the code with various switches? Thanks, Abhi -- Abhi

Re: SparkSQL

2015-01-08 Thread Marcelo Vanzin
Disclaimer: this seems more of a CDH question, I'd suggest sending these to the CDH mailing list in the future. CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but it does not include the thrift server because of incompatibilities with the CDH version of Hive. To use Hive

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote: Hi,

correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
Could anyone come up with your experience on how to do this? I have created a cluster and installed cdh5.3.0 on it with basically core + Hbase. but cloudera installed and configured the spark in its parcels anyway. I'd like to install our custom spark on this cluster to use the hadoop and hbase

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
Disclaimer: CDH questions are better handled at cdh-us...@cloudera.org. But the question I'd like to ask is: why do you need your own Spark build? What's wrong with CDH's Spark that it doesn't work for you? On Thu, Jan 8, 2015 at 3:01 PM, freedafeng freedaf...@yahoo.com wrote: Could anyone come

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng I checked the Input data for each stage. For example, in my attached screen snapshot, the input data is 1212.5MB, which is the total amount of the whole table [image: Inline image 1] And, I also check the input data for each task (in the stage detail page). And the sum of

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Shixiong Zhu
`--jars` accepts a comma-separated list of jars. See the usage about `--jars` --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. Best Regards, Shixiong Zhu 2015-01-08 19:23 GMT+08:00 Guillermo Ortiz konstt2...@gmail.com: I'm trying to execute

Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
I'm trying to execute Spark from a Hadoop Cluster, I have created this script to try it: #!/bin/bash export HADOOP_CONF_DIR=/etc/hadoop/conf SPARK_CLASSPATH= for lib in `ls /user/local/etc/lib/*.jar` do SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib done

SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Mukesh Jha
Hi Experts, I am running spark inside YARN job. The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade to 5.3.0 it cannot fetch containers with the below errors. Looks like the container id is incorrect and a string is present in a pace where it's expecting a number.

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
thanks! 2015-01-08 12:59 GMT+01:00 Shixiong Zhu zsxw...@gmail.com: `--jars` accepts a comma-separated list of jars. See the usage about `--jars` --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. Best Regards, Shixiong Zhu 2015-01-08

Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
When I try to execute my task with Spark it starts to copy the jars it needs to HDFS and it finally fails, I don't know exactly why. I have checked HDFS and it copies the files, so, it seems to work that part. I changed the log level to debug but there's nothing else to help. What else does Spark

RE: Spark History Server can't read event logs

2015-01-08 Thread michael.england
Hi Vanzin, I am using the MapR distribution of Hadoop. The history server logs are created by a job with the permissions: drwxrwx--- - myusername mygroup 2 2015-01-08 09:14 /apps/spark/historyserver/logs/spark-1420708455212 However, the permissions of the higher directories

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread frodo777
Hello everyone. With respect to the configuration problem that I explained before Do you have any idea what is wrong there? The problem in a nutshell: - When more than one master is started in the cluster, all of them are scheduling independently, thinking they are all leaders. - zookeeper

Several applications share the same Spark executors (or their cache)

2015-01-08 Thread preeze
Hi all, We have a web application that connects to a Spark cluster to trigger some calculation there. It also caches big amount of data in the Spark executors' cache. To meet high availability requirements we need to run 2 instances of our web application on different hosts. Doing this

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table. Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL. However, in my

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar
Actually it does causes builds with SBT 0.13.7 to fail with the error Conflicting cross-version suffixes. I have raised a defect SPARK-5143 for this. On Wed Jan 07 2015 at 23:44:21 Marcelo Vanzin van...@cloudera.com wrote: This particular case shouldn't cause problems since both of those

Spark 1.2.0 ec2 launch script hadoop native libraries not found warning

2015-01-08 Thread critikaled
Hi, Im facing this error on spark ec2 cluster when a job is submitted its says that native hadoop libraries are not found I have checked spark-env.sh and all the folders in the path but unable to find the problem even though the folder are containing. are there any performance drawbacks if we use

Eclipse flags error on KafkaUtils.createStream()

2015-01-08 Thread kc66
Hi, I am using Eclipse writing Java code. I am trying to create a Kafka receiver by: JavaPairReceiverInputDStreamString, kafka.message.Message a = KafkaUtils.createStream(jssc, String.class, Message.class, StringDecoder.class, DefaultDecoder.class, kafkaParams,

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Weird, which version did you use? Just tried a small snippet in Spark 1.2.0 shell as follows, the result showed in the web UI meets the expectation quite well: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._

Re: example insert statement in Spark SQL

2015-01-08 Thread Cheng Lian
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion is not supported though) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries The small SQL dialect provided in Spark SQL doesn't support insertion

Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian
This package is moved here: https://github.com/databricks/spark-avro On 1/6/15 5:12 AM, yanenli2 wrote: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
The |+| operator only handles numeric data types, you may register you own concat function like this: |sqlContext.registerFunction(concat, (s: String, t: String) = s + t) sqlContext.sql(select concat('$', col1) from tbl) | Cheng On 1/5/15 1:13 PM, RK wrote: The issue is happening when I try

Parquet compression codecs not applied

2015-01-08 Thread Ayoub Benali
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf(spark.sql.parquet.compression.codec, gzip) the size of the generated files is the always the same, so it seems like

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Hmm. Can you set the permissions of /apps/spark/historyserver/logs to 3777? I'm not sure HDFS respects the group id bit, but it's worth a try. (BTW that would only affect newly created log directories.) On Thu, Jan 8, 2015 at 1:22 AM, michael.engl...@nomura.com wrote: Hi Vanzin, I am using

Spark Streaming Checkpointing

2015-01-08 Thread Asim Jalis
Since checkpointing in streaming apps happens every checkpoint duration, in the event of failure, how is the system able to recover the state changes that happened after the last checkpoint?

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Nevermind my last e-mail. HDFS complains about not understanding 3777... On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin van...@cloudera.com wrote: Hmm. Can you set the permissions of /apps/spark/historyserver/logs to 3777? I'm not sure HDFS respects the group id bit, but it's worth a try. (BTW

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng In your code: cacheTable(tbl) sql(select * from tbl).collect() sql(select name from tbl).collect() Running the first sql, the whole table is not cached yet. So the *input data will be the original json file. * After it is cached, the json format data is removed, so the

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Ah, my bad... You're absolute right! Just checked how this number is computed. It turned out that once an RDD block is retrieved from the block manager, the size of the block is added to the input bytes. Spark SQL's in-memory columnar format stores all columns within a single partition into a

Spark Project Fails to run multicore in local mode.

2015-01-08 Thread mixtou
I am new to Apache Spark, now i am trying my first project Space Saving Counting Algorithm and while it compiles in single core using .setMaster(local) it fails when using .setMaster(local[4]) or any number1.My Code follows:=import

Re: Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
I was adding some bad jars I guess. I deleted all the jars and copied them again and it works. 2015-01-08 14:15 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: When I try to execute my task with Spark it starts to copy the jars it needs to HDFS and it finally fails, I don't know exactly why. I

Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Todd
Hi, I have imported the Spark source code in Intellij Idea as a SBT project. I try to do maven install in Intellij Idea by clicking Install in the Spark Project Parent POM(root),but failed. I would ask which profiles should be checked. What I want to achieve is staring Spark in IDE and Hadoop

Re: Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Sean Owen
Popular topic in the last 48 hours! Just about 20 minutes ago I collected some recent information on just this topic into a pull request. https://github.com/apache/spark/pull/3952 On Thu, Jan 8, 2015 at 2:24 PM, Todd bit1...@163.com wrote: Hi, I have imported the Spark source code in Intellij

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Spidy
Hi, Can you please explain which settings did you changed? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-ConnectionManager-Corresponding-SendingConnection-to-ConnectionManagerId-tp17050p21035.html Sent from the Apache Spark User List mailing list

Parquet compression codecs not applied

2015-01-08 Thread Ayoub
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf(spark.sql.parquet.compression.codec, gzip) the size of the generated files is the always the same, so it seems like

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-08 Thread Yana Kadiyska
I'm glad you solved this issue but have a followup question for you. Wouldn't Akhil's solution be better for you after all? I run similar computation where a large set of data gets reduced to a much smaller aggregate in an interval. If you do saveAsText without coalescing, I believe you'd get the

Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
Very interesting approach. Thanks for sharing it! On Thu, Jan 8, 2015 at 5:30 PM, Enno Shioji eshi...@gmail.com wrote: FYI I found this approach by Ooyala. /** Instrumentation for Spark based on accumulators. * * Usage: * val instrumentation = new

Join RDDs with DStreams

2015-01-08 Thread Asim Jalis
Is there a way to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
You are looking for dstream.transform(rdd = rdd.op(otherRdd)) The docs contain an example on how to use transform. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams -kr, Gerard. On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis asimja...@gmail.com

Re: Registering custom metrics

2015-01-08 Thread Enno Shioji
FYI I found this approach by Ooyala. /** Instrumentation for Spark based on accumulators. * * Usage: * val instrumentation = new SparkInstrumentation(example.metrics) * val numReqs = sc.accumulator(0L) * instrumentation.source.registerDailyAccumulator(numReqs, numReqs) *

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a

Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin r...@databricks.com wrote: Depending on your use cases. If the use case is to extract small

Re: SparkSQL support for reading Avro files

2015-01-08 Thread yanenli2
thanks for the reply! Now I know that this package is moved here: https://github.com/databricks/spark-avro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981p21040.html Sent from the Apache Spark User List

Find S3 file attributes by Spark

2015-01-08 Thread rajnish
Hi, We have file in AWS S3 bucket, that is loaded frequently, When accessing that file from spark, can we get file properties by some method in spark? Regards Raj -- View this message in context:

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Sorry for the noise; but I just remembered you're actually using MapR (and not HDFS), so maybe the 3777 trick could work... On Thu, Jan 8, 2015 at 10:32 AM, Marcelo Vanzin van...@cloudera.com wrote: Nevermind my last e-mail. HDFS complains about not understanding 3777... On Thu, Jan 8, 2015 at

Re: Spark Project Fails to run multicore in local mode.

2015-01-08 Thread Dean Wampler
Use local[*] instead of local to grab all available cores. Using local just grabs one. Dean On Thursday, January 8, 2015, mixtou mix...@gmail.com wrote: I am new to Apache Spark, now i am trying my first project Space Saving Counting Algorithm and while it compiles in single core using

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Aaron Davidson
Do note that this problem may be fixed in Spark 1.2, as we changed the default transfer service to use a Netty-based one rather than the ConnectionManager. On Thu, Jan 8, 2015 at 7:05 AM, Spidy yoni...@gmail.com wrote: Hi, Can you please explain which settings did you changed? -- View

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this? This will make it easier to triage this issue. https://issues.apache.org/jira/browse/SPARK Thanks, Josh On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com wrote: Hello everyone. With respect to the configuration problem that

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Sandy Ryza
Hi Mukesh, Those line numbers in ConverterUtils in the stack trace don't appear to line up with CDH 5.3: https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java Is it possible

Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara upulband...@gmail.com wrote: Hi All, I tried to do PCA for the Iris dataset [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib

Data locality running Spark on Mesos

2015-01-08 Thread mvle
Hi, I've noticed running Spark apps on Mesos is significantly slower compared to stand-alone or Spark on YARN. I don't think it should be the case, so I am posting the problem here in case someone has some explanation or can point me to some configuration options i've missed. I'm running the

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Marcelo Vanzin
Just to add to Sandy's comment, check your client configuration (generally in /etc/spark/conf). If you're using CM, you may need to run the Deploy Client Configuration command on the cluster to update the configs to match the new version of CDH. On Thu, Jan 8, 2015 at 11:38 AM, Sandy Ryza

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang gen.tan...@gmail.com

Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen
How did you run this benchmark, and is there a open version I can try it with? And what is your configurations, like spark.locality.wait, etc? Tim On Thu, Jan 8, 2015 at 11:44 AM, mvle m...@us.ibm.com wrote: Hi, I've noticed running Spark apps on Mesos is significantly slower compared to

Initial State of updateStateByKey

2015-01-08 Thread Asim Jalis
In Spark Streaming, is there a way to initialize the state of updateStateByKey before it starts processing RDDs? I noticed that there is an overload of updateStateByKey that takes an initialRDD in the latest sources (although not in the 1.2.0 release). Is there another way to do this until this