Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html
Hi David,
Thanks for the reply and effort u put to explain the concepts.Thanks for
example.It worked.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html
Sent from the Apache Spark User
Hi all,
I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each
Hi Guys,
I am currently playing with huge data.I have an RDD which returns
RDD[List[(tuples)]].I need only the tuples to be written to textfile output
using saveAsTextFile function.
example:val mod=modify.saveASTextFile() returns
Hi Guys,
I just want to know whether their is any way to determine which file is
being handled by spark from a group of files input inside a
directory.Suppose I have 1000 files which are given as input,I want to
determine which file is being handled currently by spark program so that if
any error
Hi,
Could anyone suggest an idea how can we create sparkContext object in other
classes or fucntions where we need to convert a scala collection to RDD
using sc object.like sc.makeRDD(list).instead of using Main class
sparkcontext object?
is their a way to pass sc object as a parameter to
Hi,,
I have large dataset of elemenst[RDD] and i want to divide it into two
exactly equal sized partitions maintaining order of elements.I tried using
RangePartitioner like var data= partitionedFile.partitionBy(new
RangePartitioner(2, partitionedFile)).
This doesnt give satisfactory results
Hi,
I have a problem when i want to use spark kryoserializer by extending a
class Kryoregistarar to register custom classes inorder to create objects.I
am getting following exception When I run following program..Please let me
know what could be the problem...
] (run-main)
Hi Therry,
Thanks for the above responses..I implemented using RangePartitioner..we
need to use any of the custom partitioners in orderto perform this
task..Normally u cant maintain a counter becoz count operations should
beperformed on each partitioned block ofdata...
--
View this message in
Hi,
Can we convert directly scala collection to spark RDD data type without
using parellize method?
Is their any way to create custom converted RDD datatype from scala type
using some typecast like that?
Please suggest me
--
View this message in context:
Hi,
I have an RDD of elements and want to create a new RDD by Zipping other RDD
in order.
result[RDD] with sequence of 10,20,30,40,50 ...elements.
I am facing problems as index is not an RDD...its gives an error...Could
anyone help me how we can zip it or map it inorder to obtain following
Thanks sonal.Is der anyother way like to map values with Increasing
indexes...so that i can map(t=(i,t)) where value if 'i' increases after
each map operation on element...
Please help me ..in this aspect
--
View this message in context:
Hi,
I want to perform map operation on an RDD of elements such that resulting
RDD is a key value pair(counter,value)
For example var k:RDD[Int]=10,20,30,40,40,60...
k.map(t=(i,t)) where 'i' value should be like a counter whose value
increments after each mapoperation...
Pleas help me..
I tried
Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I
have RDD of say 24 elements.In that when i partioned into two groups of 12
elements each.Their is loss of order of elements in partition.Elemest are
partitioned randomly.I need to preserve the order such that the first
Hi Andriana,
Thanks for suggestion.Could you please modify my code part where I need to
do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply
appropriately..i would be thankful to you.
--
View this message in context:
Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index
We need some one who can explain us with short code snippet on given example
so that we get clear cut idea on RDDs indexing..
Guys please help us
--
View this message in context:
Hi ,
I am new to Spark scala environment.Currently I am working on Discrete
wavelet transformation algos on time series data.
I have to perform recursive additions on successive elements in RDDs.
for example
List of elements(RDDS) --a1 a2 a3 a4.
level1 Tranformation --a1+a2 a3+a4 a1-a2
18 matches
Mail list logo