Re: Efficient Key Structure in pairRDD

2014-11-11 Thread nsareen
Spark Dev / Users, help in this regard would be appreciated, we are kind of stuck at this point. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html Sent from the Apache Spark User List mailing list archive

How to change the default limiter for textFile function

2014-11-11 Thread Blind Faith
I am a newbie to spark, and I program in Python. I use textFile function to make an RDD from a file. I notice that the default limiter is newline. However I want to change this default limiter to something else. After searching the web, I came to know about textinputformat.record.delimiter

Re: Fwd: Executor Lost Failure

2014-11-11 Thread Ritesh Kumar Singh
Yes... found the output on web UI of the slave. Thanks :) On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Tasks are now getting submitted, but many tasks don't happen. Like, after

Re: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
go to your spark home and then into the conf/ directory and then edit the log4j.properties file i.e. : gedit $SPARK_HOME/conf/log4j.properties and set root logger to: log4j.rootCategory=WARN, console U don't need to build spark for the changes to take place. Whenever you open spark-shel, it

Fwd: disable log4j for spark-shell

2014-11-11 Thread Ritesh Kumar Singh
-- Forwarded message -- From: Ritesh Kumar Singh riteshoneinamill...@gmail.com Date: Tue, Nov 11, 2014 at 2:18 PM Subject: Re: disable log4j for spark-shell To: lordjoe lordjoe2...@gmail.com Cc: u...@spark.incubator.apache.org go to your spark home and then into the conf/

Is there any benchmark suite for spark?

2014-11-11 Thread Hu Liu
I need to do some testing on spark and looking for some open source tools.Is there any benchmark suite for spark that covers common use cases like HiBench of Intel(https://github.com/intel-hadoop/HiBench)? -- Best Regards, Hu Liu

Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

2014-11-11 Thread shahab
Hi, I have a spark application which uses Cassandra connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load data from Cassandra into spark. Everything works fine in the local mode, when I run in my IDE. But when I submit the application to be executed in standalone Spark server,

JdbcRDD and ClassTag issue

2014-11-11 Thread nitinkalra2000
Hi All, I am trying to access SQL Server through JdbcRDD. But getting error on ClassTag place holder. Here is the code which I wrote public void readFromDB() { String sql = Select * from Table_1 where values = ? and values = ?;

Where can I find logs from workers PySpark

2014-11-11 Thread jan.zikes
Hi,  I am trying to do some logging in my PySpark jobs, particularly in map that is performed on workers. Unfortunately I am not able tofind these logs. Based on the documentation it seems that the logs should be on masters in the SPARK_KOME, directory work

java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Jishnu Menath Prathap (WT01 - BAS)
Hi I am getting the following error while executing a scala_twitter program for spark 14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V 14/11/11 16:39:23 ERROR

Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Akhil Das
You are not having the twitter4j jars in the classpath. While running it in the cluster mode you need to ship those dependency jars. You can do like: sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar, /home/akhld/jars/twitter4j-stream-3.0.3.jar) You can make sure they are shipped by

save as file

2014-11-11 Thread Naveen Kumar Pokala
Hi, I am spark 1.1.0. I need a help regarding saving rdd in a JSON file? How to do that? And how to mentions hdfs path in the program. -Naveen

RE: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Jishnu Menath Prathap (WT01 - BAS)
Hi Thank you Akhil for reply. I am not using cluster mode I am doing in local mode val sparkConf = new SparkConf().setAppName(TwitterPopularTags).setMaster(local).set(spark.eventLog.enabled,true) Also is there anywhere documented which Twitter4j version to be

Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ashic Mahtab
Hi, I'm trying to submit a spark application fro network share to the spark master. Network shares are configured so that the master and all nodes have access to the target ja at (say): \\shares\publish\Spark\app1\someJar.jar And this is mounted on each linux box (i.e. master and workers) at:

Re: java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener

2014-11-11 Thread Akhil Das
You can pick the dependency version from here http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10 Thanks Best Regards On Tue, Nov 11, 2014 at 6:36 PM, Jishnu Menath Prathap (WT01 - BAS) jishnu.prat...@wipro.com wrote: Hi Thank you Akhil for reply.

Best practice for multi-user web controller in front of Spark

2014-11-11 Thread bethesda
We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes

subscribe

2014-11-11 Thread DAVID SWEARINGEN
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Tao Xiao
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode, which means the driver is not running on local node. So how can I kill such a job? Is there a command like hadoop job -kill job-id which kills a running MapReduce job ? Thanks

Re: save as file

2014-11-11 Thread Akhil Das
One approach would be to use SaveAsNewAPIHadoop file and specify jsonOutputFormat. Another simple one would be like: val rdd = sc.parallelize(1 to 100) val json = rdd.map(x = { val m: Map[String, Int] = Map(id - x) new JSONObject(m) }) json.saveAsTextFile(output) Thanks Best

Re: save as file

2014-11-11 Thread Ritesh Kumar Singh
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output to any location specified. The params to be provided are: path of storage location no. of partitions For giving an hdfs path we use the following format: /user/user-name/directory-to-sore/ On Tue, Nov 11, 2014 at 6:28 PM,

Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
But that requires an (unnecessary) load from disk. I have run into this same issue, where we want to save intermediate results but continue processing. The cache / persist feature of Spark doesn't seem designed for this case. Unfortunately I'm not aware of a better solution with the current

Re: How to kill a Spark job running in cluster mode ?

2014-11-11 Thread Ritesh Kumar Singh
There is a property : spark.ui.killEnabled which needs to be set true for killing applications directly from the webUI. Check the link: Kill Enable spark job http://spark.apache.org/docs/latest/configuration.html#spark-ui Thanks On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal

Re: Spark + Tableau

2014-11-11 Thread Bojan Kostic
I finally solved issue with Spark Tableau connection. Thanks Denny Lee for blog post: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql Solution was to use Authentication type Username. And then use username for metastore. Best regards Bojan -- View this message in context:

Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Tom Seddon
Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same thing. I'm working with multiple broadcast variables in PySpark, most of which are small, but one of around 4.5GB, using 10 workers at 31GB memory each and driver with same spec. It's not running out of

Re: Cassandra spark connector exception: NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

2014-11-11 Thread Helena Edelson
Hi, It looks like you are building from master (spark-cassandra-connector-assembly-1.2.0). - Append this to your com.google.guava declaration: % provided - Be sure your version of the connector dependency is the same as the assembly build. For instance, if you are using 1.1.0-beta1, build your

Re: Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ritesh Kumar Singh
Never tried this form but just guessing, What's the output when you submit this jar: \\shares\publish\Spark\app1\ someJar.jar using spark-submit.cmd

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-11-11 Thread Tom Seddon
Yes please can you share. I am getting this error after expanding my application to include a large broadcast variable. Would be good to know if it can be fixed with configuration. On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com wrote: Can you list what your fix was so

scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer;

Combining data from two tables in two databases postgresql, JdbcRDD.

2014-11-11 Thread akshayhazari
I want to be able to perform a query on two tables in different databases. I want to know whether it can be done. I've heard about union of two RDD's but here I want to connect to something like different partitions of a table. Any help is appreciated import java.io.Serializable; //import

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Spark and Play

2014-11-11 Thread Akshat Aranya
Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Sadhan Sood
Hi Cheng, I made sure the only hive server running on the machine is hivethriftserver2. /usr/lib/jvm/default-java/bin/java -cp /usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf -Xms512m

What should be the number of partitions after a union and a subtractByKey

2014-11-11 Thread Darin McBeath
Assume the following where both updatePairRDD and deletePairRDD are both HashPartitioned.  Before the union, each one of these has 512 partitions.   The new created updateDeletePairRDD has 1024 partitions.  Is this the general/expected behavior for a union (the number of partitions to double)?

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Xiangrui Meng
Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello

Re: Broadcast failure with variable size of ~ 500mb with key already cancelled ?

2014-11-11 Thread Davies Liu
There is a open PR [1] to support broadcast larger than 2G, could you try it? [1] https://github.com/apache/spark/pull/2659 On Tue, Nov 11, 2014 at 6:39 AM, Tom Seddon mr.tom.sed...@gmail.com wrote: Hi, Just wondering if anyone has any advice about this issue, as I am experiencing the same

Re: MLLib Decision Tress algorithm hangs, others fine

2014-11-11 Thread Xiangrui Meng
Could you provide more information? For example, spark version, dataset size (number of instances/number of features), cluster size, error messages from both the drive and the executor. -Xiangrui On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote: Hello all, I have some text data

Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, This is my Instrument java

Re: scala.MatchError

2014-11-11 Thread Michael Armbrust
Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: I think you need a Java bean class instead of a normal class. See example here:

Re: Still struggling with building documentation

2014-11-11 Thread Alessandro Baretta
Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected

Re: How to execute a function from class in distributed jar on each worker node?

2014-11-11 Thread aaronjosephs
I'm not sure that this will work but it makes sense to me. Basically you write the functionality in a static block in a class and broadcast that class. Not sure what your use case is but I need to load a native library and want to avoid running the init in mapPartitions if it's not necessary (just

S3 table to spark sql

2014-11-11 Thread Franco Barrientos
How can i create a date field in spark sql? I have a S3 table and i load it into a RDD. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int, sku: String, unidades: Double,

pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there, I am wondering how to get the column family names and column qualifier names when using pyspark to read an hbase table with multiple column families. I have a hbase table as follows, hbase(main):007:0 scan 'data1' ROW COLUMN+CELL

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Hi, Is there a way to extract only the English language tweets when using TwitterUtils.createStream()? The filters argument specifies the strings that need to be contained in the tweets, but I am not sure how this can be used to specify the language. thanks -- View this message in context:

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're 3-clause BSD: https://github.com/jpmml/jpmml-model So some of the scoring components are off-limits for an AL2 project but the core model components are OK. On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tathagata Das
You could get all the tweets in the stream, and then apply filter transformation on the DStream of tweets to filter away non-english tweets. The tweets in the DStream is of type twitter4j.Status which has a field describing the language. You can use that in the filter. Though in practice, a lot

Failed jobs showing as SUCCEEDED on web UI

2014-11-11 Thread Brett Meyer
I¹m running a Python script using spark-submit on YARN in an EMR cluster, and if I have a job that fails due to ExecutorLostFailure or if I kill the job, it still shows up on the web UI with a FinalStatus of SUCCEEDED. Is this due to PySpark, or is there potentially some other issue with the job

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following, class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } } I feel using 'result.value()' here is a big

failed to create a table with python (single node)

2014-11-11 Thread Pagliari, Roberto
I'm executing this example from the documentation (in single node mode) # sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) # Queries can be expressed in HiveQL. results =

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Thanks for the response. I tried the following : tweets.filter(_.getLang()=en) I get a compilation error: value getLang is not a member of twitter4j.Status But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at:

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Small typo in my code in the previous post. That should be: tweets.filter(_.getLang()==en) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html Sent from the Apache Spark User List

groupBy for DStream

2014-11-11 Thread SK
Hi. 1) I dont see a groupBy() method for a DStream object. Not sure why that is not supported. Currently I am using filter () to separate out the different groups. I would like to know if there is a way to convert a DStream object to a regular RDD so that I can apply the RDD methods like

Re: closure serialization behavior driving me crazy

2014-11-11 Thread Sandy Ryza
I tried turning on the extended debug info. The Scala output is a little opaque (lots of - field (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC), but it seems like, as expected, somehow the full array of OLSMultipleLinearRegression objects is

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does not

Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog:

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-11 Thread Harry Brundage
We've had success with Azkaban from LinkedIn over Oozie and Luigi. http://azkaban.github.io/ Azkaban has support for many different job types, a fleshed out web UI with decent log reporting, a decent failure / retry model, a REST api, and I think support for multiple executor slaves is coming in

concat two Dstreams

2014-11-11 Thread Josh J
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh

Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called union On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav
simple scala val date = new java.text.SimpleDateFormat(mmdd).parse(fechau3m) should work. Replace mmdd with the format fechau3m is in. If you want to do it at case class level: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) //HiveContext always a good idea import

RE: Spark and Play

2014-11-11 Thread Mohammed Guller
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x Here is a sample build.sbt file: name := xyz version := 0.1 scalaVersion := 2.10.4 libraryDependencies ++= Seq( jdbc, anorm, cache, org.apache.spark %% spark-core % 1.1.0, com.typesafe.akka %% akka-actor % 2.2.3,

RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller
David, Here is what I would suggest: 1 - Does a new SparkContext get created in the web tier for each new request for processing? Create a single SparkContext that gets shared across multiple web requests. Depending on the framework that you are using for the web-tier, it should not be

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter. Just couple of lines of code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html Sent from the

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
I have an RDD of logs that look like this:

in function prototypes?

2014-11-11 Thread spr
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded overloaded method value updateStateByKey with alternatives ... cannot be applied to ... Comparing the two uses I'm

Re: Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
OK I got it working with: z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element = element.split(=)(1))).toMap) But I'm guessing there is a more efficient way than to create two separate lists and then zip them together and then convert the result into a map. -- View this

overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-11 Thread spr
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded overloaded method value updateStateByKey with alternatives ... cannot be applied to ... Comparing the two uses I'm

SVMWithSGD default threshold

2014-11-11 Thread Caron
I'm hoping to get a linear classifier on a dataset. I'm using SVMWithSGD to train the data. After running with the default options: val model = SVMWithSGD.train(training, numIterations), I don't think SVM has done the classification correctly. My observations: 1. the intercept is always 0.0 2.

Re: how to create a Graph in GraphX?

2014-11-11 Thread ankurdave
You should be able to construct the edges in a single map() call without using collect(): val edges: RDD[Edge[String]] = sc.textFile(...).map { line = val row = line.split(,) Edge(row(0), row(1), row(2) } val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1) -- View this

Does spark can't work with HBase?

2014-11-11 Thread gzlj
hello,all I have tested reading Hbase table with spark1.1 using SparkContext.newAPIHadoopRDD.I found the performance is much slower than reading from HIVE.I also try read data using HFileScanner on one region HFile,but the performance is not good.So,How do I improve performance spark reading

Imbalanced shuffle read

2014-11-11 Thread ankits
Im running a job that uses groupByKey(), so it generates a lot of shuffle data. Then it processes this and writes files to HDFS in a forEachPartition block. Looking at the forEachPartition stage details in the web console, all but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with

ISpark class not found

2014-11-11 Thread Laird, Benjamin
I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread alaa
Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to it. How did you write HBaseResultToStringConverter to do what you wanted it to do? -- View this message in context:

RE: Help with processing multiple RDDs

2014-11-11 Thread Kapil Malik
Hi, How is 78g distributed in driver, daemon, executor ? Can you please paste the logs regarding that I don't have enough memory to hold the data in memory Are you collecting any data in driver ? Lastly, did you try doing a re-partition to create smaller and evenly distributed partitions?

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread Tobias Pfeiffer
Hi, On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote: If I run bin/spark-shell without connecting a master, it can access a hdfs file on a remote cluster with kerberos authentication. [...] However, if I start the master and slave on the same host and using bin/spark-shell

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes although I think this difference is on purpose as part of that commercial strategy. If future versions change license it would still be possible to not upgrade. Or fork / recreate the bean classes. Not worried so much but it is a good point. On Nov 11, 2014 10:06 PM, DB Tsai dbt...@dbtsai.com

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
You need to set the spark configuration property: spark.yarn.access.namenodes to your namenode. e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020 Similarly, I'm curious if you're also running high availability HDFS with an HA nameservice. I currently have HA HDFS and kerberos and I've

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
Only YARN mode is supported with kerberos. You can't use a spark:// master with kerberos. Tobias Pfeiffer wrote When you give a spark://* master, Spark will run on a different machine, where you have not yet authenticated to HDFS, I think. I don't know how to solve this, though, maybe some

Re: Help with processing multiple RDDs

2014-11-11 Thread buring
i think you can try to set lower spark.storage.memoryFraction,for example 0.4 conf.set(spark.storage.memoryFraction,0.4) //default 0.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html Sent from the

MLLIB usage: BLAS dependency warning

2014-11-11 Thread jpl
Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a single machine) and running the Spark-shell with the kmeans example code (from https://spark.apache.org/docs/latest/mllib-clustering.html) which runs successfully but I

RE: Help with processing multiple RDDs

2014-11-11 Thread Khandeshi, Ami
I am running as Local in client mode. I have allocated as high as 85g to the driver, executor, and daemon. When I look at java processes. I see two. I see 20974 SparkSubmitDriverBootstrapper 21650 Jps 21075 SparkSubmit I have tried repartition before, but my understanding is that comes with

Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension),

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer
Hi, also there is Spindle https://github.com/adobe-research/spindle which was introduced on this list some time ago. I haven't looked into it deeply, but you might gain some valuable insights from their architecture, they are also using Spark to fulfill requests coming from the web. Tobias

External table partitioned by date using Spark SQL

2014-11-11 Thread ehalpern
I have large json files stored in S3 grouped under a sub-key for each year like this: I've defined an external table that's partitioned by year to keep the year limited queries efficient. The table definition looks like this: But alas, a simple query like: yields no results. If I remove the

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tobias Pfeiffer
Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- What version of twitter4j does Spark Streaming use?

Re: concat two Dstreams

2014-11-11 Thread Sean Owen
Concatenate? no. It doesn't make sense in this context to think about one potentially infinite stream coming after another one. Do you just want the union of batches from two streams? yes, just union(). You can union() with non-streaming RDDs too. On Tue, Nov 11, 2014 at 10:41 PM, Josh J

Re: Converting Apache log string into map using delimiter

2014-11-11 Thread Sean Owen
I think it would be faster/more compact as: z.map(_.map { element = val tokens = element.split(=) (tokens(0), tokens(1)) }.toMap) (That's probably 95% right but I didn't compile or test it.) On Wed, Nov 12, 2014 at 12:18 AM, YaoPau jonrgr...@gmail.com wrote: OK I got it working

Re: Still struggling with building documentation

2014-11-11 Thread Sean Owen
(I don't think that's the same issue. This looks like some local problem with tool installation?) On Tue, Nov 11, 2014 at 9:56 PM, Patrick Wendell pwend...@gmail.com wrote: The doc build appears to be broken in master. We'll get it patched up before the release:

Re: SVMWithSGD default threshold

2014-11-11 Thread Sean Owen
I think you need to use setIntercept(true) to get it to allow a non-zero intercept. I also kind of agree that's not obvious or the intuitive default. Is your data set highly imbalanced, with lots of positive examples? that could explain why predictions are heavily skewed. Iterations should

How to solve this core dump error

2014-11-11 Thread shiwentao
I got a core dump when I used spark 1.1.0 . The enviorment is shown here. software enviorment: OS: cent os 6.3 jvm : Java HotSpot(TM) 64-Bit Server VM(build 21.0-b17,mixed mode) hardware enviorment:  memory: 64G I run three spark process with jvm -Xmx args like this: -Xmx 28G -Xmx 2G

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote: In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014

Read a HDFS file from Spark source code

2014-11-11 Thread rapelly kartheek
Hi I am trying to access a file in HDFS from spark source code. Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!!

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread hmxxyy
Thanks guys for the info. I have to use yarn to access a kerberos cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html Sent from the Apache Spark User List mailing list archive at

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread Nick Pentreath
Feel free to add that converter as an option in the Spark examples via a PR :) — Sent from Mailbox On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote: Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But

Re: Read a HDFS file from Spark source code

2014-11-11 Thread Samarth Mailinglist
Instead of a file path, use a HDFS URI. For example: (In Python) data = sc.textFile(hdfs://localhost/user/someuser/data) ​ On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to access a file in HDFS from spark source code. Basically, I am

  1   2   >