Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-24 Thread Aniket Bhatnagar
Hi all I was finally able to get this working by setting the SPARK_EXECUTOR_INSTANCES to a high number. However, I am wondering if this is a bug because the app gets submitted but ceases to run because it can't run desired number of workers. Shouldn't the app be rejected if it cant be run on the

Re: HdfsWordCount only counts some of the words

2014-09-24 Thread aka.fe2s
I guess because this example is stateless, so it outputs counts only for given RDD. Take a look at stateful word counter StatefulNetworkWordCount.scala On Wed, Sep 24, 2014 at 4:29 AM, SK skrishna...@gmail.com wrote: I execute it as follows: $SPARK_HOME/bin/spark-submit --master master url

Re: SparkSQL: Freezing while running TPC-H query 5

2014-09-24 Thread Samay
Hey Dan, Thanks for your reply. I have a couple of questions. 1) Were you able to verify that this is because of GC? If yes, then could you let me know how. 2) If this is GC, then do you know of any tuning I can do to reduce this GC pause? Regards, Samay On Tue, Sep 23, 2014 at 11:15 PM, Dan

[Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2014-09-24 Thread Aniket Bhatnagar
Hi all Reading through Spark streaming's custom receiver documentation, it is recommended that onStart and onStop methods should not block indefinitely. However, looking at the source code of KinesisReceiver, the onStart method calls worker.run that blocks until worker is shutdown (via a call to

Re: HdfsWordCount only counts some of the words

2014-09-24 Thread Sean Owen
If you look at the code for HdfsWordCount, you see it calls print(), which defaults to print 10 elements from each RDD. If you are just talking about the console output, then it is not expected to print all words to begin with. On Wed, Sep 24, 2014 at 2:29 AM, SK skrishna...@gmail.com wrote: I

Re: Converting one RDD to another

2014-09-24 Thread Sean Owen
top returns the specified number of largest elements in your RDD. They are returned to the driver as an Array. If you want to make an RDD out of them again, call SparkContext.parallelize(...). Make sure this is what you mean though. On Wed, Sep 24, 2014 at 5:33 AM, Deep Pradhan

sortByKey trouble

2014-09-24 Thread david
Hi, Does anybody know how to use sortbykey in scala on a RDD like : val rddToSave = file.map(l = l.split(\\|)).map(r = (r(34)+-+r(3), r(4), r(10), r(12))) besauce, i received ann error sortByKey is not a member of ord.apache.spark.rdd.RDD[(String,String,String,String)]. What i try do

RE: sortByKey trouble

2014-09-24 Thread Shao, Saisai
Hi, SortByKey is only for RDD[(K, V)], each tuple can only has two members, Spark will sort with first member, if you want to use sortByKey, you have to change your RDD[(String, String, String, String)] into RDD[(String, (String, String, String))]. Thanks Jerry -Original Message-

Re: sortByKey trouble

2014-09-24 Thread Liquan Pei
Hi David, Can you try val rddToSave = file.map(l = l.split(\\|)).map(r = (r(34)+-+r(3), (r(4), r(10), r(12 ? That should work. Liquan On Wed, Sep 24, 2014 at 1:29 AM, david david...@free.fr wrote: Hi, Does anybody know how to use sortbykey in scala on a RDD like : val

All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, I have a setup (in mind) where data is written to Kafka and this data is persisted in HDFS (e.g., using camus) so that I have an all-time archive of all stream data ever received. Now I want to process that all-time archive and when I am done with that, continue with the live stream, using

Spark Streaming

2014-09-24 Thread Reddy Raja
Given this program.. I have the following queries.. val sparkConf = new SparkConf().setAppName(NetworkWordCount) sparkConf.set(spark.master, local[2]) val ssc = new StreamingContext(sparkConf, Seconds(10)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.

find subgraph in Graphx

2014-09-24 Thread uuree
Hi, I want to extract a subgraph using two clauses such as (?person worksAt MIT MIT locatesIn USA), I have this query. My data is constructed as triplets so I assume Graphx will suit it best. But I found out that subgraph method only supports a single edge checking and it seems to me that I

Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Petr Novak
Hello, if our Hadoop cluster is configured with HA and fs.defaultFS points to a namespace instead of a namenode hostname - hdfs://namespace_name/ - then our Spark job fails with exception. Is there anything to configure or it is not implemented? Exception in thread main

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal

RE: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal

java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal

java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal

Re: sortByKey trouble

2014-09-24 Thread david
thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions : (String,String,String,String) (String,(String,String,String)) -- View this message in context:

Re: All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: So you have a single Kafka topic which has very high retention period ( that decides the storage capacity of a given Kafka topic) and you want to process all historical data first using Camus and

How to sort rdd filled with existing data structures?

2014-09-24 Thread Tao Xiao
Hi , I have the following rdd : val conf = new SparkConf() .setAppName( Testing Sorting ) val sc = new SparkContext(conf) val L = List( (new Student(XiaoTao, 80, 29), I'm Xiaotao), (new Student(CCC, 100, 24), I'm CCC), (new Student(Jack, 90, 25), I'm Jack),

Spark Cassandra Connector Issue and performance

2014-09-24 Thread pouryas
Hey all I tried spark connector with Cassandra and I ran into a problem that I was blocked on for couple of weeks. I managed to find a solution to the problem but I am not sure whether it was a bug of the connector/spark or not. I had three tables in Cassandra (Running Cassandra on 5 node

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-24 Thread myasuka
Thank for your reply! Specifically, in our developing environment, I want to know is there any solution to let every tak do just once sub-matrices multiplication? After 'groupBy', every node with 16 cores should get 16 pair matrices, thus I set every node with 16 tasks to hope every core do once

Optimal Cluster Setup for Spark

2014-09-24 Thread pouryas
Hi there What is an optimal cluster setup for spark? Given X amount of resources, would you favour more worker nodes with less resources or less worker node with more resources. Is this application dependent? If so what are the things to consider, what are good practices? Cheers -- View this

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread Victor Tso-Guillen
Do you have a newline or some other strange character in an argument you pass to spark that includes that 61608? On Wed, Sep 24, 2014 at 4:11 AM, jishnu.prat...@wipro.com wrote: Hi , *I am getting this weird error while starting Worker. * -bash-4.1$ spark-class

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I was able to read RC files with the following line: val file: RDD[(LongWritable, BytesRefArrayWritable)] = sc.hadoopFile(hdfs://day=2014-08-10/hour=00/, classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable],500) Try with

Re: Spark Streaming

2014-09-24 Thread Akhil Das
See the inline response. On Wed, Sep 24, 2014 at 4:05 PM, Reddy Raja areddyr...@gmail.com wrote: Given this program.. I have the following queries.. val sparkConf = new SparkConf().setAppName(NetworkWordCount) sparkConf.set(spark.master, local[2]) val ssc = new

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Matt Narrell
The part that works is the commented out, single receiver stream below the loop. It seems that when I call KafkaUtils.createStream more than once, I don’t receive any messages. I’ll dig through the logs, but at first glance yesterday I didn’t see anything suspect. I’ll have to look closer.

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist. Is this normal or a bug? Is there are a workaround? Thanks - Marius

Re: Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Matt Narrell
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak oss.mli...@gmail.com wrote: Hello, if our Hadoop cluster is configured with HA and fs.defaultFS points to a namespace instead of a namenode hostname -

Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Arun Ahuja
What is the proper way to specify java options for the Spark executors using spark-submit? We had done this previously using export SPARK_JAVA_OPTS='.. previously, for example to attach a debugger to each executor or add -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps On spark-submit I

RDD of Iterable[String]

2014-09-24 Thread Deep Pradhan
Can we iterate over RDD of Iterable[String]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You

Re: Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Larry Xiao
Hi Arun! I think you can find info at https://spark.apache.org/docs/latest/configuration.html quote: Spark provides three locations to configure the system: * Spark properties https://spark.apache.org/docs/latest/configuration.html#spark-propertiescontrol most application

Spark : java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception. Could you please me to resolve this issue. Below is my piece of code val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) var s =rdd.map(x

[no subject]

2014-09-24 Thread Jianshi Huang
One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1. The

Re:

2014-09-24 Thread Ted Yu
bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$ anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) Can you reveal what HbaseRDDBatch.scala does ? Cheers On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: One of my big spark program

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
(Forgot to add the subject, so again...) One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found batched Put actually faster than generating HFiles... Jianshi On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: bq. at

Re:

2014-09-24 Thread Ted Yu
Just a shot in the dark: have you checked region server (logs) to see if region server had trouble keeping up ? Cheers On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch).

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, It converts RDD[Edge] to

Re:

2014-09-24 Thread Ted Yu
I was thinking along the same line. Jianshi: See http://hbase.apache.org/book.html#d0e6369 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com wrote: HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under

Re:

2014-09-24 Thread Jianshi Huang
The worker is not writing to HBase, it's just stuck. One task usually finishes around 20~40 mins, and I waited 6 hours for the buggy ones before killed it. There're only 6 out of 3000 tasks got stuck Jianshi On Wed, Sep 24, 2014 at 11:55 PM, Ted Yu yuzhih...@gmail.com wrote: Just a shot in

Re:

2014-09-24 Thread Jianshi Huang
Hi Debasish, Tables are presplitted and balanced, rows have skews but is not that serious. I checked HBase Master UI, all region servers are idle (0 request) and the table is not under any compaction. Jianshi On Wed, Sep 24, 2014 at 11:56 PM, Debasish Das debasish.da...@gmail.com wrote: HBase

WindowedDStreams and hierarchies

2014-09-24 Thread Pablo Medina
Hi everyone, I'm trying to understand the windowed operations functioning. What I want to achieve is the following: val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream(localhost, ) val window5 = lines.window(Seconds(5),Seconds(5)).reduce((time1:

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, See my previous reply to Debasish, all region servers are idle. I don't think it's caused by hotspotting. Besides, only 6 out of 3000 tasks were stuck, and their inputs are about only 80MB each. Jianshi On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu yuzhih...@gmail.com wrote: I was

Re: task getting stuck

2014-09-24 Thread Ted Yu
Adding a subject. bq. at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll( ParquetFileReader.java:599) Looks like there might be some issue reading the Parquet file. Cheers On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, See my

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormatText,Text and has a RecordReaderText,Text This throws a not Serializable exception on Text - changing the type to InputFormatStringBuffer, StringBuffer works with minor code changes. I do not, however,

Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please point me the example program for Spark HBase to read columns and values Regards, Rajesh

Re: task getting stuck

2014-09-24 Thread Debasish Das
spark SQL reads parquet file fine...did you follow one of these to read/write parquet from spark ? http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu yuzhih...@gmail.com wrote: Adding a subject. bq. at parquet.hadoop.ParquetFileReader$

Re: Spark Hbase

2014-09-24 Thread Ted Yu
Take a look at the following under examples: examples/src//main/python/hbase_inputformat.py examples/src//main/python/hbase_outputformat.py examples/src//main/scala/org/apache/spark/examples/HBaseTest.scala examples/src//main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala

Re: Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Ted, Thank you for quick response and details. I have verified HBaseTest.scala program it is returning hbaseRDD but i'm not able retrieve values from hbaseRDD. Could you help me to retrieve the values from hbaseRDD Regards, Rajesh On Wed, Sep 24, 2014 at 10:12 PM, Ted Yu

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using Accumulo's Key and Value classes, both of which extend Writable. Works like a charm. I use the same InputFormat for all my MR jobs. -Russ On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis lordjoe2...@gmail.com wrote: I

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Sean Owen
See the scaladoc for how to define an implicit ordering to use with sortByKey: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions Off the top of my head, I think this is 90% correct to order by age for example: implicit val studentOrdering:

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier mps@gmail.com wrote: Hello, sc.textFile and so on support wildcards in their path, but

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust mich...@databricks.com wrote: This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier

Re: parquetFile and wilcards

2014-09-24 Thread Nicholas Chammas
Does it make sense for us to open a JIRA to track enhancing the Parquet input format to support wildcards? Or is this something outside of Spark's control? Nick On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust mich...@databricks.com wrote: This behavior is inherited from the parquet input

Re: RDD of Iterable[String]

2014-09-24 Thread Liquan Pei
Hi Deep, The Iterable trait in scala has methods like map and reduce that you can use to iterate elements of Iterable[String]. You can also create an Iterator from the Iterable. For example, suppose you have val rdd: RDD[Iterable[String]] you can do rdd.map { x = //x has type Iterable[String]

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Tim Smith
Maybe differences between JavaPairDStream and JavaPairReceiverInputDStream? On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell matt.narr...@gmail.com wrote: The part that works is the commented out, single receiver stream below the loop. It seems that when I call KafkaUtils.createStream more than

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
No, they do not implement Serializable. There are a couple of places where I've had to do a Text-String conversion but generally it hasn't been a problem. -Russ On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis lordjoe2...@gmail.com wrote: Do your custom Writable classes implement Serializable - I

RDD save as Seq File

2014-09-24 Thread Shay Seng
Hi, Why does RDD.saveAsObjectFile() to S3 leave a bunch of *_$folder$ empty files around? Is it possible for the saveas to clean up? tks

Re: Spark with YARN

2014-09-24 Thread Greg Hill
Do you have YARN_CONF_DIR set in your environment to point Spark to where your yarn configs are? Greg From: Raghuveer Chanda raghuveer.cha...@gmail.commailto:raghuveer.cha...@gmail.com Date: Wednesday, September 24, 2014 12:25 PM To:

Re: Spark with YARN

2014-09-24 Thread Matt Narrell
This just shows the driver. Click the Executors tab in the Spark UI mn On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda raghuveer.cha...@gmail.com wrote: Hi, I'm new to spark and facing problem with running a job in cluster using YARN. Initially i ran jobs using spark master as --master

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You'll need to look at the driver output to have a better idea of what's going on. You can use yarn logs --applicationId blah after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply, I have doubt as to which path to set for YARN_CONF_DIR My /etc/hadoop folder has the following sub folders conf conf.cloudera.hdfs conf.cloudera.mapreduce conf.cloudera.yarn and both conf and conf.cloudera.yarn folders have yarn-site.xml. As of now I set the variable as

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
The screenshot executors.8080.png is of the executors tab itself and only driver is added without workers even if I kept the master as yarn-cluster. On Wed, Sep 24, 2014 at 11:18 PM, Matt Narrell matt.narr...@gmail.com wrote: This just shows the driver. Click the Executors tab in the Spark UI

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:789) at

Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Never mind: https://issues.apache.org/jira/browse/SPARK-1476 On Wed, Sep 24, 2014 at 11:10 AM, Victor Tso-Guillen v...@paxata.com wrote: Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599

spark.sql.autoBroadcastJoinThreshold

2014-09-24 Thread sridhar1135
Does this work with spark-sql in 1.0.1 too ? I tried like this sqlContext.sql(SET spark.sql.autoBroadcastJoinThreshold=1;) But it still seems to trigger shuffleMapTask() and amount of shuffle is same with / without this parameter... Kindly request some help here Thanks -- View

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
We could certainly do this. The comma separated support is something I added. On Wed, Sep 24, 2014 at 10:20 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Does it make sense for us to open a JIRA to track enhancing the Parquet input format to support wildcards? Or is this something

Re: RDD save as Seq File

2014-09-24 Thread Sean Owen
It's really Hadoop's support for S3. Hadoop FS semantics need directories, and S3 doesn't have a proper notion of directories. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html You should leave them AFAIK. On Wed, Sep 24, 2014 at 6:41 PM, Shay Seng

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply .. This is the error in the logs obtained from UI at http://dml3:8042/node/containerlogs/container_1411578463780_0001_02_01/chanda So now how to set the Log Server url .. Failed while trying to construct the redirect url to the log server. Log Server url may not be

Re: Spark Code to read RCFiles

2014-09-24 Thread Pramod Biligiri
I'm afraid SparkSQL isn't an option for my use case, so I need to use the Spark API itself. I turned off Kryo, and I'm getting a NullPointerException now: scala val ref = file.take(1)(0)._2 ref: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable =

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You need to use the command line yarn application that I mentioned (yarn logs). You can't look at the logs through the UI after the app stops. On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda raghuveer.cha...@gmail.com wrote: Thanks for the reply .. This is the error in the logs obtained from

Null values in pyspark Row

2014-09-24 Thread jamborta
Hi all, I have just updated to spark 1.1.0. The new row representation of the data in spark SQL is very handy. I have noticed that it does not set None for NULL values coming from Hive if the column was string type - seems it works with other types. Is that something that will be implemented?

Re: Null values in pyspark Row

2014-09-24 Thread Davies Liu
Could create a JIRA and add test cases for it? Thanks! Davies On Wed, Sep 24, 2014 at 11:56 AM, jamborta jambo...@gmail.com wrote: Hi all, I have just updated to spark 1.1.0. The new row representation of the data in spark SQL is very handy. I have noticed that it does not set None for

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Yeah I got the logs and its reporting about the memory. 14/09/25 00:08:26 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Now I shifted to big cluster with more memory but here im not

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I used the following code as an example to deserialize BytesRefArrayWritable. http://www.massapi.com/source/hive-0.5.0-dev/src/ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java.html Best Regards, Cem. On Wed, Sep 24, 2014 at 1:34 PM, Pramod Biligiri pramodbilig...@gmail.com wrote:

persist before or after checkpoint?

2014-09-24 Thread Shay Seng
Hey, I actually have 2 question (1) I want to generate unique IDs for each RDD element and I want to assign them in parallel so I do rdd.mapPartitionsWithIndex((index, s) = { var count = 0L s.zipWithIndex.map { case (t, i) = { count += 1 (index *

Hive Avro table fail (pyspark)

2014-09-24 Thread ilyagluk
Dear Spark Users, I'd highly appreciate any comments on whether there is a way to define an Avro table and then run select * without errors. Thank you very much in advance! Ilya. Attached please find a sample data directory and the Avro schema. analytics_events2.zip

Re: New sbt plugin to deploy jobs to EC2

2014-09-24 Thread Shafaq
Hi, testing out the Spark Ec2 deployment plugin: I try to compile using $sbt sparkLaunchCluster --- [info] Resolving org.fusesource.jansi#jansi;1.4 ... [warn]

Question About Submit Application

2014-09-24 Thread danilopds
Hello, I'm learning about Spark Streaming and I'm really excited. Today I was testing to package some apps and send them in a Standalone cluster in my computer locally. It occurred ok. So, I created one Virtual Machine with network bridge and I tried to send again the app to this VM from my local

Re: Spark Streaming Twitter Example Error

2014-09-24 Thread danilopds
I solved this question using the SBT plugin sbt-assembly. It's very good! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600p15073.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Memory/Network Intensive Workload

2014-09-24 Thread danilopds
Thank you for the suggestion! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501p15074.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Question About Submit Application

2014-09-24 Thread danilopds
One more information.. When I submit the application from my local PC to my VM, This VM was the master and worker and my local PC wasn't part of the cluster. Thanks. -- View this message in context:

Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
I am attempting to use Spark Streaming to summarize event data streaming in and save it to a MySQL table. The input source data is stored in 4 topics on Kafka with each topic having 12 partitions. My approach to doing this worked both in local development and a simulated load testing environment

Re: Logging in Spark through YARN.

2014-09-24 Thread Vipul Pandey
Archit, Are you able to get it to work with 1.0.0? I tried the --files suggestion from Marcelo and it just changed logging for my client and the appmaster and executors were still the same. ~Vipul On Jul 30, 2014, at 9:59 PM, Archit Thakur archit279tha...@gmail.com wrote: Hi Marcelo,

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window durations to no effect. I'm not too sure how to choose these? br/br/ Currently today I've been testing with batch duration of 1 minute - 10 minute and reduce window duration of 10 minute or 20 minutes. -- View this

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here: https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22 I'm creating a JDBC connection on every record, that's probably whats killing the performance. I assume the fix is just broadcast the

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng men...@gmail.com wrote: 1) MLlib 1.1 should be faster than 1.0 in

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-24 Thread Shailesh Birari
Note, the data is random numbers (double). Any suggestions/pointers will be highly appreciated. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15083.html Sent from the

MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
Hi All, When I try to load dataset using MLUtils.loadLibSVMFile, I have the following problem. Any help will be greatly appreciated! Code snippet: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import

Bug in JettyUtils?

2014-09-24 Thread Jim Donahue
I've been seeing a failure in JettyUtils.createStaticHandler when running Spark 1.1 programs from Eclipse which I hadn't seen before. The problem seems to be line 129 Option(Utils.getSparkClassLoader.getResource(resourceBase)) match { ... getSparkClassLoader returns NULL, and it's all

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
If you launched the job in yarn-cluster mode, the tracking URL is printed on the output of the launched process. That will lead you to the Spark UI once the job is running. If you're using CM, you can reach the same link by clicking on the Resource Manager UI link on your Yarn service, then

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin
Sounds like spark-01 is not resolving correctly on your machine (or is the wrong address). Can you ping spark-01 and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds danilob...@gmail.com wrote: Hello, I'm learning about Spark Streaming and I'm

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
Hi Sameer, This seems to be a file format issue. Can you make sure that your data satisfies the format? Each line of libsvm format is as follows: label index1:value1 index2:value2 ... Thanks, Liquan On Wed, Sep 24, 2014 at 3:02 PM, Sameer Tilak ssti...@live.com wrote: Hi All, When I

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Liquan Pei
You only need to define an ordering of student, no need to modify the class definition of student. It's like a Comparator class in java. Currently, you have to map the rdd to sort by value. Liquan On Wed, Sep 24, 2014 at 9:52 AM, Sean Owen so...@cloudera.com wrote: See the scaladoc for how to

Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-24 Thread Aris
Hi Manish! Thanks for the reply and the explication on why the branches won't compile -- that makes perfect sense. You mentioned making the branch compatible with the latest master; could you share some more details? Which branch do you mean -- is it on your GitHub? And would I just be able to

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Aris
Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the

  1   2   >