Spark Dev / Users, help in this regard would be appreciated, we are kind of
stuck at this point.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461p18557.html
Sent from the Apache Spark User List mailing list archive
I am a newbie to spark, and I program in Python. I use textFile function to
make an RDD from a file. I notice that the default limiter is newline.
However I want to change this default limiter to something else. After
searching the web, I came to know about textinputformat.record.delimiter
Yes... found the output on web UI of the slave.
Thanks :)
On Tue, Nov 11, 2014 at 2:48 AM, Ankur Dave ankurd...@gmail.com wrote:
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Tasks are now getting submitted, but many tasks don't happen.
Like, after
go to your spark home and then into the conf/ directory and then edit the
log4j.properties file i.e. :
gedit $SPARK_HOME/conf/log4j.properties
and set root logger to:
log4j.rootCategory=WARN, console
U don't need to build spark for the changes to take place. Whenever you
open spark-shel, it
-- Forwarded message --
From: Ritesh Kumar Singh riteshoneinamill...@gmail.com
Date: Tue, Nov 11, 2014 at 2:18 PM
Subject: Re: disable log4j for spark-shell
To: lordjoe lordjoe2...@gmail.com
Cc: u...@spark.incubator.apache.org
go to your spark home and then into the conf/
I need to do some testing on spark and looking for some open source
tools.Is there any benchmark suite for spark that covers common use cases
like HiBench of Intel(https://github.com/intel-hadoop/HiBench)?
--
Best Regards,
Hu Liu
Hi,
I have a spark application which uses Cassandra
connectorspark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar to load
data from Cassandra into spark.
Everything works fine in the local mode, when I run in my IDE. But when I
submit the application to be executed in standalone Spark server,
Hi All,
I am trying to access SQL Server through JdbcRDD. But getting error on
ClassTag place holder.
Here is the code which I wrote
public void readFromDB() {
String sql = Select * from Table_1 where values = ? and
values = ?;
Hi,
I am trying to do some logging in my PySpark jobs, particularly in map that is
performed on workers. Unfortunately I am not able tofind these logs. Based on
the documentation it seems that the logs should be on masters in the
SPARK_KOME, directory work
Hi I am getting the following error while executing a scala_twitter program for
spark
14/11/11 16:39:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with
error: java.lang.NoSuchMethodError:
twitter4j.TwitterStream.addListener(Ltwitter4j/StatusListener;)V
14/11/11 16:39:23 ERROR
You are not having the twitter4j jars in the classpath. While running it in
the cluster mode you need to ship those dependency jars.
You can do like:
sparkConf.setJars(/home/akhld/jars/twitter4j-core-3.0.3.jar,
/home/akhld/jars/twitter4j-stream-3.0.3.jar)
You can make sure they are shipped by
Hi,
I am spark 1.1.0. I need a help regarding saving rdd in a JSON file?
How to do that? And how to mentions hdfs path in the program.
-Naveen
Hi
Thank you Akhil for reply.
I am not using cluster mode I am doing in local mode
val sparkConf = new
SparkConf().setAppName(TwitterPopularTags).setMaster(local).set(spark.eventLog.enabled,true)
Also is there anywhere documented which Twitter4j version to be
Hi,
I'm trying to submit a spark application fro network share to the spark master.
Network shares are configured so that the master and all nodes have access to
the target ja at (say):
\\shares\publish\Spark\app1\someJar.jar
And this is mounted on each linux box (i.e. master and workers) at:
You can pick the dependency version from here
http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10
Thanks
Best Regards
On Tue, Nov 11, 2014 at 6:36 PM, Jishnu Menath Prathap (WT01 - BAS)
jishnu.prat...@wipro.com wrote:
Hi
Thank you Akhil for reply.
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit. Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
I'm using Spark 1.0.0 and I'd like to kill a job running in cluster mode,
which means the driver is not running on local node.
So how can I kill such a job? Is there a command like hadoop job -kill
job-id which kills a running MapReduce job ?
Thanks
One approach would be to use SaveAsNewAPIHadoop file and specify
jsonOutputFormat.
Another simple one would be like:
val rdd = sc.parallelize(1 to 100)
val json = rdd.map(x = {
val m: Map[String, Int] = Map(id - x)
new JSONObject(m) })
json.saveAsTextFile(output)
Thanks
Best
We have RDD.saveAsTextFile and RDD.saveAsObjectFile for saving the output
to any location specified. The params to be provided are:
path of storage location
no. of partitions
For giving an hdfs path we use the following format:
/user/user-name/directory-to-sore/
On Tue, Nov 11, 2014 at 6:28 PM,
But that requires an (unnecessary) load from disk.
I have run into this same issue, where we want to save intermediate results
but continue processing. The cache / persist feature of Spark doesn't seem
designed for this case. Unfortunately I'm not aware of a better solution
with the current
There is a property :
spark.ui.killEnabled
which needs to be set true for killing applications directly from the webUI.
Check the link:
Kill Enable spark job
http://spark.apache.org/docs/latest/configuration.html#spark-ui
Thanks
On Tue, Nov 11, 2014 at 7:42 PM, Sonal Goyal
I finally solved issue with Spark Tableau connection.
Thanks Denny Lee for blog post:
https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
Solution was to use Authentication type Username. And then use username for
metastore.
Best regards
Bojan
--
View this message in context:
Hi,
Just wondering if anyone has any advice about this issue, as I am
experiencing the same thing. I'm working with multiple broadcast variables
in PySpark, most of which are small, but one of around 4.5GB, using 10
workers at 31GB memory each and driver with same spec. It's not running
out of
Hi,
It looks like you are building from master
(spark-cassandra-connector-assembly-1.2.0).
- Append this to your com.google.guava declaration: % provided
- Be sure your version of the connector dependency is the same as the assembly
build. For instance, if you are using 1.1.0-beta1, build your
Never tried this form but just guessing,
What's the output when you submit this jar: \\shares\publish\Spark\app1\
someJar.jar
using spark-submit.cmd
Yes please can you share. I am getting this error after expanding my
application to include a large broadcast variable. Would be good to know if
it can be fixed with configuration.
On 23 October 2014 18:04, Michael Campbell michael.campb...@gmail.com
wrote:
Can you list what your fix was so
Hi,
This is my Instrument java constructor.
public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
super();
this.issue = issue;
this.issuer = issuer;
I want to be able to perform a query on two tables in different databases. I
want to know whether it can be done. I've heard about union of two RDD's but
here I want to connect to something like different partitions of a table.
Any help is appreciated
import java.io.Serializable;
//import
I believe the Spark Job Server by Ooyala can help you share data across
multiple jobs, take a look at
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
seems to fit closely to what you need.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Hi,
Sorry if this has been asked before; I didn't find a satisfactory answer
when searching. How can I integrate a Play application with Spark? I'm
getting into issues of akka-actor versions. Play 2.2.x uses akka-actor
2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine
Hi Cheng,
I made sure the only hive server running on the machine is
hivethriftserver2.
/usr/lib/jvm/default-java/bin/java -cp
/usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf
-Xms512m
Assume the following where both updatePairRDD and deletePairRDD are both
HashPartitioned. Before the union, each one of these has 512 partitions. The
new created updateDeletePairRDD has 1024 partitions. Is this the
general/expected behavior for a union (the number of partitions to double)?
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.
-
On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.
So I see several
Vincenzo sent a PR and included k-means as an example. Sean is helping
review it. PMML standard is quite large. So we may start with simple
model export, like linear methods, then move forward to tree-based.
-Xiangrui
On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
Hello
There is a open PR [1] to support broadcast larger than 2G, could you try it?
[1] https://github.com/apache/spark/pull/2659
On Tue, Nov 11, 2014 at 6:39 AM, Tom Seddon mr.tom.sed...@gmail.com wrote:
Hi,
Just wondering if anyone has any advice about this issue, as I am
experiencing the same
Could you provide more information? For example, spark version,
dataset size (number of instances/number of features), cluster size,
error messages from both the drive and the executor. -Xiangrui
On Mon, Nov 10, 2014 at 11:28 AM, tsj tsj...@gmail.com wrote:
Hello all,
I have some text data
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui
On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
Hi,
This is my Instrument java
Xiangrui is correct that is must be a java bean, also nested classes are
not yet supported in java.
On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote:
I think you need a Java bean class instead of a normal class. See
example here:
Nichols and Patrick,
Thanks for your help, but, no, it still does not work. The latest master
produces the following scaladoc errors:
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
not found: type Type
[error] protected
I'm not sure that this will work but it makes sense to me. Basically you
write the functionality in a static block in a class and broadcast that
class. Not sure what your use case is but I need to load a native library
and want to avoid running the init in mapPartitions if it's not necessary
(just
How can i create a date field in spark sql? I have a S3 table and i load it
into a RDD.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int,
sku: String, unidades: Double,
Hello there,
I am wondering how to get the column family names and column qualifier names
when using pyspark to read an hbase table with multiple column families.
I have a hbase table as follows,
hbase(main):007:0 scan 'data1'
ROW COLUMN+CELL
JPMML evaluator just changed their license to AGPL or commercial
license, and I think AGPL is not compatible with apache project. Any
advice?
https://github.com/jpmml/jpmml-evaluator
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Hi,
Is there a way to extract only the English language tweets when using
TwitterUtils.createStream()? The filters argument specifies the strings
that need to be contained in the tweets, but I am not sure how this can be
used to specify the language.
thanks
--
View this message in context:
Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're
3-clause BSD:
https://github.com/jpmml/jpmml-model
So some of the scoring components are off-limits for an AL2 project but the
core model components are OK.
On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai dbt...@dbtsai.com
You could get all the tweets in the stream, and then apply filter
transformation on the DStream of tweets to filter away non-english
tweets. The tweets in the DStream is of type twitter4j.Status which
has a field describing the language. You can use that in the filter.
Though in practice, a lot
I¹m running a Python script using spark-submit on YARN in an EMR cluster,
and if I have a job that fails due to ExecutorLostFailure or if I kill the
job, it still shows up on the web UI with a FinalStatus of SUCCEEDED. Is
this due to PySpark, or is there potentially some other issue with the job
checked the source, found the following,
class HBaseResultToStringConverter extends Converter[Any, String] {
override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
}
}
I feel using 'result.value()' here is a big
I'm executing this example from the documentation (in single node mode)
# sc is an existing SparkContext.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
# Queries can be expressed in HiveQL.
results =
Thanks for the response. I tried the following :
tweets.filter(_.getLang()=en)
I get a compilation error:
value getLang is not a member of twitter4j.Status
But getLang() is one of the methods of twitter4j.Status since version 3.0.6
according to the doc at:
Small typo in my code in the previous post. That should be:
tweets.filter(_.getLang()==en)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html
Sent from the Apache Spark User List
Hi.
1) I dont see a groupBy() method for a DStream object. Not sure why that is
not supported. Currently I am using filter () to separate out the different
groups. I would like to know if there is a way to convert a DStream object
to a regular RDD so that I can apply the RDD methods like
I tried turning on the extended debug info. The Scala output is a little
opaque (lots of - field (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name:
$iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC), but it seems
like, as expected, somehow the full array of OLSMultipleLinearRegression
objects is
The doc build appears to be broken in master. We'll get it patched up
before the release:
https://issues.apache.org/jira/browse/SPARK-4326
On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta
alexbare...@gmail.com wrote:
Nichols and Patrick,
Thanks for your help, but, no, it still does not
Hi There,
Because Akka versions are not binary compatible with one another, it
might not be possible to integrate Play with Spark 1.1.0.
- Patrick
On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote:
Hi,
Sorry if this has been asked before; I didn't find a satisfactory
I also worry about that the author of JPMML changed the license of
jpmml-evaluator due to his interest of his commercial business, and he
might change the license of jpmml-model in the future.
Sincerely,
DB Tsai
---
My Blog:
We've had success with Azkaban from LinkedIn over Oozie and Luigi.
http://azkaban.github.io/
Azkaban has support for many different job types, a fleshed out web UI with
decent log reporting, a decent failure / retry model, a REST api, and I
think support for multiple executor slaves is coming in
Hi,
Is it possible to concatenate or append two Dstreams together? I have an
incoming stream that I wish to combine with data that's generated by a
utility. I then need to process the combined Dstream.
Thanks,
Josh
I think it's just called union
On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote:
Hi,
Is it possible to concatenate or append two Dstreams together? I have an
incoming stream that I wish to combine with data that's generated by a
utility. I then need to process the combined
simple
scala val date = new
java.text.SimpleDateFormat(mmdd).parse(fechau3m)
should work. Replace mmdd with the format fechau3m is in.
If you want to do it at case class level:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
//HiveContext always a good idea
import
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x
Here is a sample build.sbt file:
name := xyz
version := 0.1
scalaVersion := 2.10.4
libraryDependencies ++= Seq(
jdbc,
anorm,
cache,
org.apache.spark %% spark-core % 1.1.0,
com.typesafe.akka %% akka-actor % 2.2.3,
David,
Here is what I would suggest:
1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests.
Depending on the framework that you are using for the web-tier, it should not
be
just wrote a custom convert in scala to replace HBaseResultToStringConverter.
Just couple of lines of code.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html
Sent from the
More thoughts. I took a deeper look at BlockManager, RDD, and friends.
Suppose one wanted to get native code access to un-deserialized blocks.
This task looks very hard. An RDD behaves much like a Scala iterator of
deserialized values, and interop with BlockManager is all on deserialized
data.
I have an RDD of logs that look like this:
I am creating a workflow; I have an existing call to updateStateByKey that
works fine, but when I created a second use where the key is a Tuple2, it's
now failing with the dreaded overloaded method value updateStateByKey with
alternatives ... cannot be applied to ... Comparing the two uses I'm
OK I got it working with:
z.map(row = (row.map(element = element.split(=)(0)) zip row.map(element
= element.split(=)(1))).toMap)
But I'm guessing there is a more efficient way than to create two separate
lists and then zip them together and then convert the result into a map.
--
View this
I am creating a workflow; I have an existing call to updateStateByKey that
works fine, but when I created a second use where the key is a Tuple2, it's
now failing with the dreaded overloaded method value updateStateByKey with
alternatives ... cannot be applied to ... Comparing the two uses I'm
I'm hoping to get a linear classifier on a dataset.
I'm using SVMWithSGD to train the data.
After running with the default options: val model =
SVMWithSGD.train(training, numIterations),
I don't think SVM has done the classification correctly.
My observations:
1. the intercept is always 0.0
2.
You should be able to construct the edges in a single map() call without
using collect():
val edges: RDD[Edge[String]] = sc.textFile(...).map { line =
val row = line.split(,)
Edge(row(0), row(1), row(2)
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)
--
View this
hello,all
I have tested reading Hbase table with spark1.1 using
SparkContext.newAPIHadoopRDD.I found the performance is much slower than
reading from HIVE.I also try read data using HFileScanner on one region
HFile,but the performance is not good.So,How do I improve performance spark
reading
Im running a job that uses groupByKey(), so it generates a lot of shuffle
data. Then it processes this and writes files to HDFS in a forEachPartition
block. Looking at the forEachPartition stage details in the web console, all
but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with
I've been experimenting with the ISpark extension to IScala
(https://github.com/tribbloid/ISpark)
Objects created in the REPL are not being loaded correctly on worker nodes,
leading to a ClassNotFound exception. This does work correctly in spark-shell.
I was curious if anyone has used ISpark
Hey freedafeng, I'm exactly where you are. I want the output to show the
rowkey and all column qualifiers that correspond to it. How did you write
HBaseResultToStringConverter to do what you wanted it to do?
--
View this message in context:
Hi,
How is 78g distributed in driver, daemon, executor ?
Can you please paste the logs regarding that I don't have enough memory to
hold the data in memory
Are you collecting any data in driver ?
Lastly, did you try doing a re-partition to create smaller and evenly
distributed partitions?
Hi,
On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy hmx...@gmail.com wrote:
If I run bin/spark-shell without connecting a master, it can access a hdfs
file on a remote cluster with kerberos authentication.
[...]
However, if I start the master and slave on the same host and using
bin/spark-shell
Yes although I think this difference is on purpose as part of that
commercial strategy. If future versions change license it would still be
possible to not upgrade. Or fork / recreate the bean classes. Not worried
so much but it is a good point.
On Nov 11, 2014 10:06 PM, DB Tsai dbt...@dbtsai.com
You need to set the spark configuration property: spark.yarn.access.namenodes
to your namenode.
e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020
Similarly, I'm curious if you're also running high availability HDFS with an
HA nameservice.
I currently have HA HDFS and kerberos and I've
Only YARN mode is supported with kerberos. You can't use a spark:// master
with kerberos.
Tobias Pfeiffer wrote
When you give a spark://* master, Spark will run on a different machine,
where you have not yet authenticated to HDFS, I think. I don't know how to
solve this, though, maybe some
i think you can try to set lower spark.storage.memoryFraction,for example 0.4
conf.set(spark.storage.memoryFraction,0.4) //default 0.6
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html
Sent from the
Hi,
I am having trouble using the BLAS libs with the MLLib functions. I am
using org.apache.spark.mllib.clustering.KMeans (on a single machine) and
running the Spark-shell with the kmeans example code (from
https://spark.apache.org/docs/latest/mllib-clustering.html) which runs
successfully but I
I am running as Local in client mode. I have allocated as high as 85g to the
driver, executor, and daemon. When I look at java processes. I see two. I
see
20974 SparkSubmitDriverBootstrapper
21650 Jps
21075 SparkSubmit
I have tried repartition before, but my understanding is that comes with
In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except the
java.lang.NegativeArraySizeException error and details is listed below.
Moreover, when broadcast a relatively smaller numpy array(30M dimension),
Hi,
also there is Spindle https://github.com/adobe-research/spindle which was
introduced on this list some time ago. I haven't looked into it deeply, but
you might gain some valuable insights from their architecture, they are
also using Spark to fulfill requests coming from the web.
Tobias
I have large json files stored in S3 grouped under a sub-key for each year
like this:
I've defined an external table that's partitioned by year to keep the year
limited queries efficient. The table definition looks like this:
But alas, a simple query like:
yields no results.
If I remove the
Hi,
On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:
But getLang() is one of the methods of twitter4j.Status since version 3.0.6
according to the doc at:
http://twitter4j.org/javadoc/twitter4j/Status.html#getLang--
What version of twitter4j does Spark Streaming use?
Concatenate? no. It doesn't make sense in this context to think about one
potentially infinite stream coming after another one. Do you just want the
union of batches from two streams? yes, just union(). You can union() with
non-streaming RDDs too.
On Tue, Nov 11, 2014 at 10:41 PM, Josh J
I think it would be faster/more compact as:
z.map(_.map { element =
val tokens = element.split(=)
(tokens(0), tokens(1))
}.toMap)
(That's probably 95% right but I didn't compile or test it.)
On Wed, Nov 12, 2014 at 12:18 AM, YaoPau jonrgr...@gmail.com wrote:
OK I got it working
(I don't think that's the same issue. This looks like some local problem
with tool installation?)
On Tue, Nov 11, 2014 at 9:56 PM, Patrick Wendell pwend...@gmail.com wrote:
The doc build appears to be broken in master. We'll get it patched up
before the release:
I think you need to use setIntercept(true) to get it to allow a non-zero
intercept. I also kind of agree that's not obvious or the intuitive default.
Is your data set highly imbalanced, with lots of positive examples? that
could explain why predictions are heavily skewed.
Iterations should
I got a core dump when I used spark 1.1.0 . The enviorment is shown here.
software enviorment:
OS: cent os 6.3
jvm : Java HotSpot(TM) 64-Bit Server VM(build 21.0-b17,mixed mode)
hardware enviorment:
memory: 64G
I run three spark process with jvm -Xmx args like this:
-Xmx 28G
-Xmx 2G
This PR fix the problem: https://github.com/apache/spark/pull/2659
cc @josh
Davies
On Tue, Nov 11, 2014 at 7:47 PM, bliuab bli...@cse.ust.hk wrote:
In spark-1.0.2, I have come across an error when I try to broadcast a quite
large numpy array(with 35M dimension). The error information except
Dear Liu:
Thank you very much for your help. I will update that patch. By the way, as
I have succeed to broadcast an array of size(30M) the log said that such
array takes around 230MB memory. As a result, I think the numpy array that
leads to error is much smaller than 2G.
On Wed, Nov 12, 2014
Hi
I am trying to access a file in HDFS from spark source code. Basically, I
am tweaking the spark source code. I need to access a file in HDFS from the
source code of the spark. I am really not understanding how to go about
doing this.
Can someone please help me out in this regard.
Thank you!!
Thanks guys for the info.
I have to use yarn to access a kerberos cluster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html
Sent from the Apache Spark User List mailing list archive at
Feel free to add that converter as an option in the Spark examples via a PR :)
—
Sent from Mailbox
On Wed, Nov 12, 2014 at 3:27 AM, alaa contact.a...@gmail.com wrote:
Hey freedafeng, I'm exactly where you are. I want the output to show the
rowkey and all column qualifiers that correspond to
Fwiw if you do decide to handle language detection on your machine this
library works great on tweets https://github.com/carrotsearch/langid-java
On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote:
But
Instead of a file path, use a HDFS URI.
For example: (In Python)
data = sc.textFile(hdfs://localhost/user/someuser/data)
On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi
I am trying to access a file in HDFS from spark source code. Basically,
I am
1 - 100 of 117 matches
Mail list logo