You can look at ways to group records from both rdds together instead of
doing Cartesian. Say generate pair rdd from each with first letter as key.
Then do a partition and a join.
On May 25, 2016 8:04 PM, "Priya Ch" wrote:
> Hi,
> RDD A is of size 30MB and RDD B
Hi,
RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
like to filter out the strings that have greater than 85% match and
generate a score for it which is used in the susbsequent calculations.
I tried generating pair rdd from both rdds A and B with same key for all
the
Solr or Elastic search provide much more functionality and are faster in this
context. The decision for or against them depends on your current and future
use cases. Your current use case is still very abstract so in order to get a
more proper recommendation you need to provide more details
Why do i need to deploy solr for text anaytics...i have files placed in
HDFS. just need to look for matches against each string in both files and
generate those records whose match is > 85%. We trying to Fuzzy match
logic.
How can use map/reduce operations across 2 rdds ?
Thanks,
Padma Ch
On
Alternatively depending on the exact use case you may employ solr on Hadoop for
text analytics
> On 25 May 2016, at 12:57, Priya Ch wrote:
>
> Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
> strings as
No this is not needed, look at the map / reduce operations and the standard
spark word count
> On 25 May 2016, at 12:57, Priya Ch wrote:
>
> Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
> strings as {"padma","hihi","chch","priya"}.
Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i have to check
the matches against all strings in RDD B which means I need generate
What is the use case of this ? A Cartesian product is by definition slow in any
system. Why do you need this? How long does your application take now?
> On 25 May 2016, at 12:42, Priya Ch wrote:
>
> I tried
>
I tried
dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
this is taking too much time.
Thanks,
Padma Ch
On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro
wrote:
> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
>
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
parquet, orc, ...?
// maropu
On Wed, May 25, 2016 at 7:10 PM, Priya Ch
wrote:
> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
> am converting the joined dataframe
Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
am converting the joined dataframe to rdd (dataframe.rdd) and using
saveAsTextFile, trying to save it. However, this is also taking too much
time.
Thanks,
Padma Ch
On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro
Hi,
Seems you'd be better off using DataFrame#join instead of RDD.cartesian
because it always needs shuffle operations which have alot of overheads
such as reflection, serialization, ...
In your case, since the smaller table is 7mb, DataFrame#join uses a
broadcast strategy.
This is a little
It is basically a Cartesian join like RDBMS
Example:
SELECT * FROM FinancialCodes, FinancialData
The results of this query matches every row in the FinancialCodes table
with every row in the FinancialData table. Each row consists of all
columns from the FinancialCodes table followed by all
Hi All,
I have two RDDs A and B where in A is of size 30 MB and B is of size 7
MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
cartesian operation ?
I am using spark 1.6.0 version
Regards,
Padma Ch
14 matches
Mail list logo