Re: Python + Spark unable to connect to S3 bucket .... "Invalid hostname in URI"

2014-08-14 Thread Miroslaw
So after doing some more research I found the root cause of the problem. The
bucket name we were using contained an underscore '_'. This goes against the
new requirements for naming buckets. Using a bucket that is not named with
an underscore solved the issue.

If anyone else runs into this problem, I hope this will help them out. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12169.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark SQL Stackoverflow error

2014-08-14 Thread Cheng, Hao
I couldn’t reproduce the exception, probably it’s solved in the latest code.

From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com]
Sent: Thursday, August 14, 2014 11:17 AM
To: user@spark.apache.org
Subject: Spark SQL Stackoverflow error

Hi,
I tried running the sample sql code JavaSparkSQL but keep getting this error:-

the error comes on line
JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 
AND age <= 19");

C:\>spark-submit --class org.apache.spark.examples.sql.JavaSparkSQL --master 
local \spark-1.0.2-bin-hadoop1\lib\spark-examples-1.0.2-hadoop1.0.4.jar
=== Data source: RDD ===
WARN  org.apache.spark.util.SizeEstimator: Failed to check whether 
UseCompressedOops is set; assuming yes
Exception in thread "main" java.lang.StackOverflowError
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)



None in RDD

2014-08-14 Thread guoxu1231
Hi Guys, 

I have a serious problem regarding the 'None' in RDD(pyspark).

Take a example of transformations that produce 'None'.
leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other.  (K, V) and (K, W), returns a
dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Because it is leftOuterJoin, The result RDD also contains None in *(K, (V,
None))*.  The None will be a trouble in subsequent transformations, every
transformations need to check the None otherwise a error will be thrown. 


Another example about the CSV load function, 
MOV = sc.textFile('/movie.csv');
MOV = MOV.map(lambda strLine: strLine.split(",")).map(lambda
data:{"MOVIE_ID":int(data[0]), "MOVIE_NAME":str(data[1]),
"MOVIE_DIRECTOR":str(data[2])});

It is expected to have 3 fields and seperated by comma in the CSV file, 
However some dirty data maybe only 2 fields. Than
"MOVIE_DIRECTOR":str(data[2])} is dangerous.(IndexError: list index out of
range)


It is common to check "None" or illegal format in a common programming
language. 
However for big data programming,  it is tedious to check None or illegal
data as the illegal data is expected.

For Apache Pig, there have a special handling for the nulls, it looks better
as none check is not needed and takeing care of illegal data as well.
http://pig.apache.org/docs/r0.12.1/basic.html#nulls

For Spark, what is the best practice to handle none and illegal data as in
above exmple?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/None-in-RDD-tp12167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark streaming - lamda architecture

2014-08-14 Thread Shao, Saisai
Hi Ali,

Maybe you can take a look at twitter's Summingbird project 
(https://github.com/twitter/summingbird), which is currently one of the few 
open source choices of lambda Architecture. There's a undergoing sub-project 
called summingbird-spark, that might be the one you wanted, might this can help 
you.

Thanks
Jerry

-Original Message-
From: salemi [mailto:alireza.sal...@udo.edu] 
Sent: Friday, August 15, 2014 11:25 AM
To: u...@spark.incubator.apache.org
Subject: Re: spark streaming - lamda architecture

below is what is what I understand under lambda architecture. The batch layer 
provides the historical data and the speed layer provides the real-time view!

All data entering the system is dispatched to both the batch layer and the 
speed layer for processing.
The batch layer has two functions: 
(i) managing the master dataset (an immutable, append-only set of raw data), and
(ii) to pre-compute the batch views.

The speed layer compensates for the high latency of updates to the serving 
layer and deals with recent data only.

The serving layer indexes the batch views so that they can be queried in 
low-latency, ad-hoc way.

Any incoming query can be answered by merging results from batch views and 
real-time views.

In my system I have events coming in from Kafka sources and currently we need 
to process 10,000 messages per second and write them out to hdfs and make them 
available to be queried by a serving layer.

What would be your suggestion to architecturally solve this issue? How many 
solution with which would approx. be needed for the proposed architecture.

Thanks,
Ali


Tathagata Das wrote
> Can you be a bit more specific about what you mean by lambda architecture?
> 
> 
> On Thu, Aug 14, 2014 at 2:27 PM, salemi <

> alireza.salemi@

> > wrote:
> 
>> Hi,
>>
>> How would you implement the batch layer of lamda architecture with 
>> spark/spark streaming?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-l
>> amda-architecture-tp12142.html Sent from the Apache Spark User List 
>> mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>


Tathagata Das wrote
> Can you be a bit more specific about what you mean by lambda architecture?
> 
> 
> On Thu, Aug 14, 2014 at 2:27 PM, salemi <

> alireza.salemi@

> > wrote:
> 
>> Hi,
>>
>> How would you implement the batch layer of lamda architecture with 
>> spark/spark streaming?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-l
>> amda-architecture-tp12142.html Sent from the Apache Spark User List 
>> mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142p12163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming - lamda architecture

2014-08-14 Thread Michael Hausenblas

>>> How would you implement the batch layer of lamda architecture with
>>> spark/spark streaming?

I assume you’re familiar with resources such as 
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark and 
are after more detailed advices?

Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 15 Aug 2014, at 05:25, salemi  wrote:

> below is what is what I understand under lambda architecture. The batch layer
> provides the historical data and the speed layer provides the real-time
> view!
> 
> All data entering the system is dispatched to both the batch layer and the
> speed layer for processing.
> The batch layer has two functions: 
> (i) managing the master dataset (an immutable, append-only set of raw data),
> and 
> (ii) to pre-compute the batch views.
> 
> The speed layer compensates for the high latency of updates to the serving
> layer and deals with recent data only.
> 
> The serving layer indexes the batch views so that they can be queried in
> low-latency, ad-hoc way.
> 
> Any incoming query can be answered by merging results from batch views and
> real-time views.
> 
> In my system I have events coming in from Kafka sources and currently we
> need to process 10,000 messages per second and write them out to hdfs and
> make them available to be queried by a serving layer.
> 
> What would be your suggestion to architecturally solve this issue? How many
> solution with which would approx. be needed for the proposed architecture.
> 
> Thanks,
> Ali
> 
> 
> Tathagata Das wrote
>> Can you be a bit more specific about what you mean by lambda architecture?
>> 
>> 
>> On Thu, Aug 14, 2014 at 2:27 PM, salemi <
> 
>> alireza.salemi@
> 
>> > wrote:
>> 
>>> Hi,
>>> 
>>> How would you implement the batch layer of lamda architecture with
>>> spark/spark streaming?
>>> 
>>> Thanks,
>>> Ali
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: 
> 
>> user-unsubscribe@.apache
> 
>>> For additional commands, e-mail: 
> 
>> user-help@.apache
> 
>>> 
>>> 
> 
> 
> Tathagata Das wrote
>> Can you be a bit more specific about what you mean by lambda architecture?
>> 
>> 
>> On Thu, Aug 14, 2014 at 2:27 PM, salemi <
> 
>> alireza.salemi@
> 
>> > wrote:
>> 
>>> Hi,
>>> 
>>> How would you implement the batch layer of lamda architecture with
>>> spark/spark streaming?
>>> 
>>> Thanks,
>>> Ali
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: 
> 
>> user-unsubscribe@.apache
> 
>>> For additional commands, e-mail: 
> 
>> user-help@.apache
> 
>>> 
>>> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142p12163.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark working directories

2014-08-14 Thread Calvin
I've had this issue too running Spark 1.0.0 on YARN with HDFS: it
defaults to a working directory located in hdfs:///user/$USERNAME and
it's not clear how to set the working directory.

In the case where HDFS has a non-standard directory structure (i.e.,
home directories located in hdfs:///users/) Spark jobs will fail.

The MapReduce setting is "mapreduce.job.working.dir". Is there a Spark
equivalent?

On Thu, Aug 14, 2014 at 7:24 PM, Yana Kadiyska  wrote:
> Hi all, trying to change defaults of where stuff gets written.
>
> I've set "-Dspark.local.dir=/spark/tmp" and I can see that the setting is
> used when the executor is started.
>
> I do indeed see directories like spark-local-20140815004454-bb3f in this
> desired location but I also see undesired stuff under /tmp
>
> usr@executor:~# ls /tmp/spark-93f4d44c-ff4d-477d-8930-5884b10b065f/
> files  jars
>
> usr@driver: ls /tmp/spark-7e456342-a58c-4439-ab69-ff8e6d6b56a5/
> files jars
>
> Is there a way to move these directories off of /tmp? I am running 0.9.1
> (SPARK_WORKER_DIR is also exported on all nodes though all that I see there
> are executor logs)
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming - lamda architecture

2014-08-14 Thread salemi
below is what is what I understand under lambda architecture. The batch layer
provides the historical data and the speed layer provides the real-time
view!

All data entering the system is dispatched to both the batch layer and the
speed layer for processing.
The batch layer has two functions: 
(i) managing the master dataset (an immutable, append-only set of raw data),
and 
(ii) to pre-compute the batch views.

The speed layer compensates for the high latency of updates to the serving
layer and deals with recent data only.

The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.

Any incoming query can be answered by merging results from batch views and
real-time views.

In my system I have events coming in from Kafka sources and currently we
need to process 10,000 messages per second and write them out to hdfs and
make them available to be queried by a serving layer.

What would be your suggestion to architecturally solve this issue? How many
solution with which would approx. be needed for the proposed architecture.

Thanks,
Ali


Tathagata Das wrote
> Can you be a bit more specific about what you mean by lambda architecture?
> 
> 
> On Thu, Aug 14, 2014 at 2:27 PM, salemi <

> alireza.salemi@

> > wrote:
> 
>> Hi,
>>
>> How would you implement the batch layer of lamda architecture with
>> spark/spark streaming?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>


Tathagata Das wrote
> Can you be a bit more specific about what you mean by lambda architecture?
> 
> 
> On Thu, Aug 14, 2014 at 2:27 PM, salemi <

> alireza.salemi@

> > wrote:
> 
>> Hi,
>>
>> How would you implement the batch layer of lamda architecture with
>> spark/spark streaming?
>>
>> Thanks,
>> Ali
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142p12163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi SK,

Not sure if I understand you correctly, but here is how the user normally
uses the event logging functionality:

After setting "spark.eventLog.enabled" and optionally "spark.eventLog.dir",
the user runs his/her Spark application and calls sc.stop() at the end of
it. Then he/she goes to the standalone Master UI (under
http://:8080
by default) and click on the application under the Completed Applications
table. This will link to the Spark UI of the finished application in its
completed state, under a path that looks like
"http://:8080/history/".
It won't be on "http://localhost:4040"; anymore because the port is now
freed for new applications to bind their SparkUIs to. To access the file
that stores the raw statistics, go to the file specified in
"spark.eventLog.dir". This is by default "/tmp/spark-events", though in
Spark 1.0.1 it may be in HDFS under the same path.

I could be misunderstanding what you mean by the stats being buried in the
console output, because the events are not logged to the console but to a
file in "spark.eventLog.dir". For all of this to work, of course, you have
to run Spark in standalone mode (i.e. with master set to
spark://:7077). In other modes, you will need to use the
history server instead.

Does this make sense?
Andrew


2014-08-14 18:08 GMT-07:00 SK :

> More specifically, as indicated by Patrick above, in 1.0+, apps will have
> persistent state so that the UI can be reloaded. Is there a way to enable
> this feature in 1.0.1?
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12157.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine.  Also is the error happening when you do the
map function or does it happen when you combine the results ?

Thanks
Shivaram


On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta <
gilbello...@gmail.com> wrote:

> Hello,
>
> I am having problems trying to apply the split-apply-combine strategy
> for dataframes using SparkR.
>
> I have a largish dataframe and I would like to achieve something similar
> to what
>
> ddply(df, .(id), foo)
>
> would do, only that using SparkR as computing engine. My df has a few
> million records and I would like to split it by "id" and operate on
> the pieces. These pieces are quite small in size: just a few hundred
> records.
>
> I do something along the following lines:
>
> 1) Use split to transform df into a list of dfs.
> 2) parallelize the resulting list as a RDD (using a few thousand slices)
> 3) map my function on the pieces using Spark.
> 4) recombine the results (do.call, rbind, etc.)
>
> My cluster works and I can perform medium sized batch jobs.
>
> However, it fails with my full df: I get a heap space out of memory
> error. It is funny as the slices are very small in size.
>
> Should I send smaller batches to my cluster? Is there any recommended
> general approach to these kind of split-apply-combine problems?
>
> Best,
>
> Carlos J. Gil Bellosta
> http://www.datanalytics.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark streaming - lamda architecture

2014-08-14 Thread Tathagata Das
Can you be a bit more specific about what you mean by lambda architecture?


On Thu, Aug 14, 2014 at 2:27 PM, salemi  wrote:

> Hi,
>
> How would you implement the batch layer of lamda architecture with
> spark/spark streaming?
>
> Thanks,
> Ali
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark working directories

2014-08-14 Thread Yana Kadiyska
Hi all, trying to change defaults of where stuff gets written.

I've set "-Dspark.local.dir=/spark/tmp" and I can see that the setting is
used when the executor is started.

I do indeed see directories like spark-local-20140815004454-bb3f in this
desired location but I also see undesired stuff under /tmp

usr@executor:~# ls /tmp/spark-93f4d44c-ff4d-477d-8930-5884b10b065f/
files  jars

usr@driver: ls /tmp/spark-7e456342-a58c-4439-ab69-ff8e6d6b56a5/
files jars

Is there a way to move these directories off of /tmp? I am running
0.9.1 (SPARK_WORKER_DIR
is also exported on all nodes though all that I see there are executor logs)

Thanks


Re: Spark webUI - application details page

2014-08-14 Thread SK
More specifically, as indicated by Patrick above, in 1.0+, apps will have
persistent state so that the UI can be reloaded. Is there a way to enable
this feature in 1.0.1?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark webUI - application details page

2014-08-14 Thread SK
I set  "spark.eventLog.enabled" to true  in
$SPARK_HOME/conf/spark-defaults.conf and also configured the logging to a
file as well as console in log4j.properties. But I am not able to get the
log of the statistics in a file. On the console there is a lot of log
messages along with the stats - so hard to separate the stats. I prefer the
online format that appears on localhost:4040 - it is more clear. I am
running the job in standalone mode on my local machine. is there some way to
recreate the stats online after the job has completed?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12156.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Getting hadoop distcp to work on ephemeral-hsfs in spark-ec2 cluster

2014-08-14 Thread Arpan Ghosh
Hi,

I have launched an AWS Spark cluster using the spark-ec2 script
(--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can
see the name node at :50070. When I try to copy files from
S3 into ephemeral-HDFS using distcp using the following command:

ephemeral-hdfs/bin/hadoop distcp  hdfs://:9001/data-platform/backfill/weekly-vel-acc-data-tables/data/drive-sample-distcp

I get the following:

Copy failed: java.net.ConnectException: Call to
ec2-54-89-53-102.compute-1.amazonaws.com/10.146.200.172:9001 failed on
connection exception: java.net.ConnectException: Connection refused

at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099)

at org.apache.hadoop.ipc.Client.call(Client.java:1075)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)

at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source)

at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)

at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)

at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480)

at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474)

at org.apache.hadoop.mapred.JobClient.(JobClient.java:457)

at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1015)

at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

Caused by: java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)

at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)

at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)

at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)

at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)

at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206)

at org.apache.hadoop.ipc.Client.call(Client.java:1050)

... 13 more


Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
I finally solved the problem by following code 

var m: org.apache.spark.mllib.classification.LogisticRegressionModel = null

m = newModel   // newModel is the loaded one, see above post of mine

val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
  val prediction = m.predict(point.features)
  (point.label, prediction)
}  // this works!

Thanks  Shixiong for his heuristic codes which lead me to this solution.

btw, accoring to this  git commit
  ,
private[mllib] will be removed  from linear models' constructors,"This is
part of SPARK-2495 to allow users construct linear models manually"



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Tracked this down to incompatibility with Scala and encryptfs.
Resolved by compiling in a directory not mounted with encryption (eg
/tmp).

On Thu, Aug 14, 2014 at 3:25 PM, Jim Blomo  wrote:
> Hi, I'm having trouble compiling a snapshot, any advice would be
> appreciated.  I get the error below when compiling either master or
> branch-1.1.  The key error is, I believe, "[ERROR] File name too long"
> but I don't understand what it is referring to.  Thanks!
>
>
> ./make-distribution.sh --tgz --skip-java-test -Phadoop-2.4  -Pyarn
> -Dyarn.version=2.4.0
>
> [ERROR]
>  while compiling:
> /home/jblomo/src/spark/core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala
> during phase: jvm
>  library version: version 2.10.4
> compiler version: version 2.10.4
>   reconstructed args: -classpath
> /home/jblomo/src/spark/core/target/scala-2.10/test-classes:/home/jblomo/src/spark/core/target/scala-2.10/classes:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/home/jblomo/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/jblomo/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/jblomo/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/jblomo/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/jblomo/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/home/jblomo/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/jblomo/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/jblomo/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/apache/avro/avro/1.7.6/avro-1.7.6.jar:/home/jblomo/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/home/jblomo/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/home/jblomo/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/home/jblomo/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/jblomo/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/jblomo/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/home/jblomo/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/home/jblomo/.m2/repository/commons-codec/commons-codec/1.5/commons-codec-1.5.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/home/jblomo/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-recipes/2.4.0/curator-recipes-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-framework/2.4.0/curator-framework-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/cu

Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi all,

As Simon explained, you need to set "spark.eventLog.enabled" to true.

I'd like to add that the usage of SPARK_JAVA_OPTS to set spark
configurations is deprecated. I'm sure many of you have noticed this from
the scary warning message we print out. :) The recommended and supported
way of setting this is by adding the line "spark.eventLog.enabled true" to
$SPARK_HOME/conf/spark-defaults.conf. This will be picked up by Spark
submit and passed to your application.

Cheers,
Andrew


2014-08-14 15:45 GMT-07:00 durin :

> If I don't understand you wrong, setting event logging in the
> SPARK_JAVA_OPTS
> should achieve what you want. I'm logging to the HDFS, but according to the
> config page    a
> folder should be possible as well.
>
> Example with all other settings removed:
>
> SPARK_JAVA_OPTS="-Dspark.eventLog.enabled=true
> -Dspark.eventLog.dir=hdfs://idp11:9100/user/myname/logs/"
>
> This works with the Spark shell, I haven't tested other applications
> though.
>
>
> Note that the completed applications will disappear from the list if you
> restart Spark completely, even though they'll still be stored in the log
> folder.
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12150.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Carlos J. Gil Bellosta
Hello,

I am having problems trying to apply the split-apply-combine strategy
for dataframes using SparkR.

I have a largish dataframe and I would like to achieve something similar to what

ddply(df, .(id), foo)

would do, only that using SparkR as computing engine. My df has a few
million records and I would like to split it by "id" and operate on
the pieces. These pieces are quite small in size: just a few hundred
records.

I do something along the following lines:

1) Use split to transform df into a list of dfs.
2) parallelize the resulting list as a RDD (using a few thousand slices)
3) map my function on the pieces using Spark.
4) recombine the results (do.call, rbind, etc.)

My cluster works and I can perform medium sized batch jobs.

However, it fails with my full df: I get a heap space out of memory
error. It is funny as the slices are very small in size.

Should I send smaller batches to my cluster? Is there any recommended
general approach to these kind of split-apply-combine problems?

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark webUI - application details page

2014-08-14 Thread durin
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS
should achieve what you want. I'm logging to the HDFS, but according to the 
config page    a
folder should be possible as well.

Example with all other settings removed:

SPARK_JAVA_OPTS="-Dspark.eventLog.enabled=true
-Dspark.eventLog.dir=hdfs://idp11:9100/user/myname/logs/"

This works with the Spark shell, I haven't tested other applications though.


Note that the completed applications will disappear from the list if you
restart Spark completely, even though they'll still be stored in the log
folder.


Best regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
There may be cases where you want to adjust the number of partitions or
explicitly call RDD.repartition or RDD.coalesce. However, I would start
with the defaults and then adjust if necessary to improve performance (for
example, if cores are idling because there aren't enough tasks you may want
more partitions).

Looks like PairRDDFunctions has a countByKey (though you'd still need to
distinct first). You might also look at combineByKey and foldByKey as
alternatives to reduceByKey or groupByKey.


On Thu, Aug 14, 2014 at 4:14 PM, bdev  wrote:

> Thanks Daniel for the detailed information. Since the RDD is already
> partitioned, there is no need to worry about repartitioning.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be
appreciated.  I get the error below when compiling either master or
branch-1.1.  The key error is, I believe, "[ERROR] File name too long"
but I don't understand what it is referring to.  Thanks!


./make-distribution.sh --tgz --skip-java-test -Phadoop-2.4  -Pyarn
-Dyarn.version=2.4.0

[ERROR]
 while compiling:
/home/jblomo/src/spark/core/src/test/scala/org/apache/spark/util/random/XORShiftRandomSuite.scala
during phase: jvm
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -classpath
/home/jblomo/src/spark/core/target/scala-2.10/test-classes:/home/jblomo/src/spark/core/target/scala-2.10/classes:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/home/jblomo/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/jblomo/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/jblomo/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/home/jblomo/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/jblomo/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/home/jblomo/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/jblomo/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/home/jblomo/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/home/jblomo/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar:/home/jblomo/.m2/repository/org/apache/avro/avro/1.7.6/avro-1.7.6.jar:/home/jblomo/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/home/jblomo/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/home/jblomo/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/home/jblomo/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/home/jblomo/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/home/jblomo/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/home/jblomo/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/home/jblomo/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/home/jblomo/.m2/repository/commons-codec/commons-codec/1.5/commons-codec-1.5.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/home/jblomo/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/home/jblomo/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-recipes/2.4.0/curator-recipes-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-framework/2.4.0/curator-framework-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/curator/curator-client/2.4.0/curator-client-2.4.0.jar:/home/jblomo/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.jar:/home/jblomo/.m2/repository/jline/jline/0.9.94/jline-0.9.94.jar:/home/jblomo/.m2/repository/o

Dealing with Idle shells

2014-08-14 Thread Gary Malouf
We have our quantitative team using Spark as part of their daily work.  One
of the more common problems we run into is that people unintentionally
leave their shells open throughout the day.  This eats up memory in the
cluster and causes others to have limited resources to run their jobs.

With something like Hive or many client applications for SQL databases,
this is not really an issue but with Spark it's a significant inconvenience
to non-technical users.  Someone ends up having to post throughout the day
in chats to ensure people are using their shells or to 'get off the
cluster'.

Just wondering if anyone else has experienced this type of issue and how
they are managing it.  One idea we've had is to implement an 'idle timeout'
monitor for the shell, though on the surface this appears quite
challenging.


Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi,

For our large ALS runs, we are considering using sc.setCheckPointDir so
that the intermediate factors are written to HDFS and the lineage is
broken...

Is there a comparison which shows the performance degradation due to these
options ? If not I will be happy to add experiments with it...

Thanks.
Deb


Re: Spark webUI - application details page

2014-08-14 Thread SK
Hi,

I am using Spark 1.0.1. But I am still not able to see the stats for
completed apps on port 4040 - only for running apps. Is this feature
supported or is there a way to log this info to some file? I am interested
in stats about the total # of executors, total runtime, and total memory
used by my Spark program.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark streaming - lamda architecture

2014-08-14 Thread salemi
Hi,

How would you implement the batch layer of lamda architecture with
spark/spark streaming?

Thanks,
Ali



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-14 Thread Denny Lee
For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay 
- Troubleshooting the Everyday Issues, the slides have been now posted at: 
http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf.

Enjoy!
Denny



Re: Spark Akka/actor failures.

2014-08-14 Thread ldmtwo
The reason we are not using MLLib and Breeze is the lack of control over the
data and performance. After computing the covariance matrix, there isn't too
much we can do after that. Many of the methods are private. For now, we need
the max value and the coresponding pair of columns. Later, we may do other
algorithms. The MLLib covariance gets the means and Gramian matrix in
parallel and after that, I believe it's back to single node computation. We
have to bring everything back to a single node to get the max. Making it
parallel again hasn't worked well either.

The reason we are using Spark is that we want a simple way to distribute
data and work in parallel. I would prefer a SIMD/MPI type of approach, but I
have to work within this framework which is more of a MapReduce style. 

I'm looking into getting the code you sent working. It won't allow me to
reduce by key.

RE: cartesian: I agree that it is generating many copies of the data. That
was a last resort. It would be a huge benefit to everyone if we could access
RDDs like a list, array or hash map. 

Here is the Covariance that works fast for us. We get the averages first
O(N^2). Then differences (Vi-Avgi) in O(N^2). Then compute Covariance
without having to do the above steps in O(N^3). You can see that I'm using
Java code to efficiently get Covariance. The Scala code was very slow in
comparison. We can next use JNI to add HW acceleration. Matrix is a HashMap
here. Also note that I am using the lower triangle. I'm sure that
MLLib/Breeze is making optimizations too.

This covariance is based off of the two pass algorithm, but we may change to
a one pass approximation.
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.correlation/Covariance.java.html







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071p12140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Joseph Bradley
I have run into that issue too, but only when the data were not
pre-processed correctly.  E.g., if a categorical feature is binary with
values in {-1, +1} instead of {0,1}.  Will be very interested to learn if
it can occur elsewhere!


On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak  wrote:

>
> Hi Yanbo,
> I think it was happening because some of the rows did not have all the
> columns. We are cleaning up the data and will let you know once we confirm
> this.
>
> --
> Date: Thu, 14 Aug 2014 22:50:58 +0800
> Subject: Re: java.lang.UnknownError: no bin was found for continuous
> variable.
> From: yanboha...@gmail.com
> To: ssti...@live.com
>
> Can you supply the detail code and data you used.
> From the log, it looks like can not find the bin for specific feature.
> The bin for continuous feature is a unit that covers a specific range of
> the feature.
>
>
> 2014-08-14 7:43 GMT+08:00 Sameer Tilak :
>
> Hi All,
>
> I am using the decision tree algorithm and I get the following error. Any
> help would be great!
>
>
> java.lang.UnknownError: no bin was found for continuous variable.
>  at
> org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)
> at
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
>  at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
> at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>  at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>  at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
>  at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
> at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
>  at
> org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>  at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:744)
> 14/08/13 16:36:06 ERROR ExecutorUncaughtExceptionHandler: Uncaught
> exception in thread Thread[Executor task launch worker-0,5,main]
> java.lang.UnknownError: no bin was found for continuous variable.
> at
> org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)
> at
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
>  at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
> at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>  at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>  at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
>  at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
> at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
>  at
> org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>  at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:744)
>
>
>


Spark on HDP

2014-08-14 Thread Padmanabh
Hi,

I was reading the documentation at http://hortonworks.com/labs/spark/
and it seems to say that Spark is not ready for enterprise, which I
think is not quite right. What I think they wanted to say is Spark on
HDP is not ready for enterprise. I was wondering if someone here is
using Spark on HDP in production applications and their experience
with it.

Hopefully, its just their webpage being out of date.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I agree. We need the support similar to parquet file for end user. That’s the 
purpose of Spark-2883.

Thanks.

Zhan Zhang

On Aug 14, 2014, at 11:42 AM, Yin Huai  wrote:

> I feel that using hadoopFile and saveAsHadoopFile to read and write ORCFile 
> are more towards developers because read/write options have to be manually 
> populated. Seems those new APIs were added by 
> https://issues.apache.org/jira/browse/HIVE-5728.
> 
> For using ORCOutputFormat (the old one) with saveAsHadoopFile, I am not sure 
> it can work properly. Because getRecordWriter will be called, the ORCFile is 
> probably created in a wrong path 
> (https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L181).
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
The errors are occurring in the exact same time in the job as
well..right at the end of the groupByKey() when 5 tasks are left.


On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh  wrote:

> Hi Davies,
>
> I tried the second option and launched my ec2 cluster with master on all
> the slaves by providing the latest commit hash of master as the
> "--spark-version" option to the spark-ec2 script. However, I am getting the
> same errors as before. I am running the job with the original
> spark-defaults.conf and spark-env.conf
>
>
> java.net.SocketException: Connection reset
>   at java.net.SocketInputStream.read(SocketInputStream.java:196)
>   at java.net.SocketInputStream.read(SocketInputStream.java:122)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>   at java.io.DataInputStream.readInt(DataInputStream.java:
> 387)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/08/14 19:48:54 ERROR python.PythonRDD: This may have been caused by a 
> prior exception:
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:329)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:327)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1249)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
>
> 14/08/14 19:48:54 ERROR executor.Executor: Exception in task 1112.0 in stage 
> 0.0 (TID 3513)
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:329)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:327)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1249)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
>
>
> 14/08/14 19:48:53 ERROR executor.Executor: Exception in task 315.0 in stage 
> 0.0 (TID 2716)
> java.net.SocketException: Connection reset

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
Hi Davies,

I tried the second option and launched my ec2 cluster with master on all
the slaves by providing the latest commit hash of master as the
"--spark-version" option to the spark-ec2 script. However, I am getting the
same errors as before. I am running the job with the original
spark-defaults.conf and spark-env.conf

java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/14 19:48:54 ERROR python.PythonRDD: This may have been caused by
a prior exception:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:329)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:327)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1249)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)

14/08/14 19:48:54 ERROR executor.Executor: Exception in task 1112.0 in
stage 0.0 (TID 3513)
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:329)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:327)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1249)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)


14/08/14 19:48:53 ERROR executor.Executor: Exception in task 315.0 in
stage 0.0 (TID 2716)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedI

Documentation to start with

2014-08-14 Thread Abhilash K Challa
Hi,

  Do any one have specific documentation for integrating Spark with hadoop
distribution(does not already have spark) ?

Thanks,
Abhilash


Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
First, I think you might have a misconception about partitioning. ALL RDDs
are partitioned (even if they are a single partition). When reading from
HDFS the number of partitions depends on how the data is stored in HDFS.
After data is shuffled (generally caused by things like reduceByKey), the
number of partitions will be whatever is set as the default parallelism
(see https://spark.apache.org/docs/latest/configuration.html). Most of
these methods allow you to specify a different number of partitions as a
parameter

The number of partitions == the number of tasks. So generally the number of
partitions you want will be enough to keep all your cores (not # nodes - #
cores) busy.

Also, to create a key you can use the map method to return a Tuple2 as
Santosh showed. You can also use keyBy.

If you just want to get the number of unique users, I would do something
like this:

val userIdColIndex: Int = ...
val pageIdColIndex: Int = ...
val inputPath: String = ...
// I'm assuming the user and page ID are both strings
val pageVisitorRdd: RDD[(String, String)] = sc.textFile(inputPath).map(
record =>
   val colValues = record.split('\t')
   // You might want error handling in here - I'll assume all records are
valid
   val userId = colValues(userIdColIndex)
   val pageId = colValues(pageIdColIndex)
   (pageId, userId) // Here's your key-value pair
}

So now that you have your pageId -> userId mappings, what to do with them?
Maybe the most obvious would be:

val uniquePageVisits: RDD[(String, Int)] =
pageVistorRdd.groupByKey().mapValues(_.toSet.size)

But groupByKey will be a bad choice if you have many visits per page
(you'll end up with a large collection in each record). It might be better
to start with a distinct, then map to (pageId, 1) and reduceByKey(_+_) to
get the sums.

I hope that helps.


On Thu, Aug 14, 2014 at 2:14 PM, bdev  wrote:

> Thanks, will give that a try.
>
> I see the number of partitions requested is 8 (through HashPartitioner(8)).
> If I have a 40 node cluster, whats the recommended number of partitions?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12128.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there,

I have several large files (500GB per file) to transform into Parquet format
and write to HDFS. The problems I encountered can be described as follows:

1) At first, I tried to load all the records in a file and then used
"sc.parallelize(data)" to generate RDD and finally used
"saveAsNewAPIHadoopFile(...)" to write to HDFS. However, because each file
was too large to be handled by memory (500GB), it did not work.

2) Then, I tried to load certain number of records at a time, but I had to
launch a lot of "saveAsNewAPIHadoopFile(...)" tasks and the file directory
became two levels:

data/0/part0 --- part29
data/1/part0 --- part29
..
And when I tried to access the "data" directory to process all the parts, I
did not know the directory hierarchy.

I do not know if HDFS has the ability to get the hierarchy of a directory.
If so, my problem can be solved by utilizing that information. Another way
is to generate all the files in a flat directory, like:

data/part0  part1

And then the API "newAPIHadoopFile" can read all of them.

Any suggestions? Thanks very much.
 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-transform-large-local-files-into-Parquet-format-and-write-into-HDFS-tp12131.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
Yes. You are right, but  I tried old hadoopFile for OrcInputFormat. In hive12, 
OrcStruct is not exposing its api, so spark cannot access it. With Hive13, RDD 
can read from OrcFile. Btw, I didn’t see ORCNewOutputFormat in hive-0.13.

Direct RDD manipulation (Hive13)

val inputRead = 
sc.hadoopFile("/user/zzhang/orc_demo",classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io.NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct])


val v = inputRead.map(pair => pair._2.toString)
val c = v.collect

Thanks.

Zhan Zhabg

On Aug 14, 2014, at 11:12 AM, Yin Huai  wrote:

> Hi Zhan,
> 
> Thank you for trying it. For "directly manipulate ORCFile through RDD", do 
> you mean using hadoopFile and saveAsHadoopFile? For "some ORC API", do you 
> mean ORCNewInputFormat, ORCNewOutputFormat and ORCStruct?
> 
> Thanks,
> 
> Yin
> 
>> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Subscribing to news releases

2014-08-14 Thread Nicholas Chammas
I've created an issue to track this: SPARK-3044: Create RSS feed for Spark
News 


On Fri, May 30, 2014 at 11:07 AM, Nick Chammas 
wrote:

> Is there a way to subscribe to news releases
> ? That would be swell.
>
> Nick
>
>
> --
> View this message in context: Subscribing to news releases
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. 

I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12128.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Hadoop InputFormat in Python

2014-08-14 Thread TJ Klein
Yes, thanks great. This seems to be the issue. 
At least running with spark-submit works as well.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday...

I had to put it in spark-env.sh and take it out from spark-defaults.conf on
1.0.1...Note that this settings should be visible on all workers..

After that I validated that SPARK_LOCAL_DIRS was indeed getting used for
shuffling...


On Thu, Aug 14, 2014 at 10:27 AM, Brad Miller 
wrote:

> Hi All,
>
> I'm having some trouble setting the disk spill directory for spark.  The
> following approaches set "spark.local.dir" (according to the "Environment"
> tab of the web UI) but produce the indicated warnings:
>
> *In spark-env.sh:*
> export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill
> *Associated warning:*
> 14/08/14 10:10:39 WARN SparkConf: In Spark 1.0 and later spark.local.dir
> will be overridden by the value set by the cluster manager (via
> SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
> 14/08/14 10:10:39 WARN SparkConf:
> SPARK_JAVA_OPTS was detected (set to '-Dspark.local.dir=/spark/spill').
> This is deprecated in Spark 1.0+.
> Please instead use...
>
> *In spark-defaults.conf:*
> spark.local.dir  /spark/spill
> *Associated warning:*
> 14/08/14 10:09:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir
> will be overridden by the value set by the cluster manager (via
> SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
>
> The following does not produce any warnings, but also produces no sign of
> actually setting "spark.local.dir":
>
> *In spark-env.sh:*
> export SPARK_LOCAL_DIRS=/spark/spill
>
> Does anybody know whether SPARK_LOCAL_DIRS actually works as advertised,
> or if I am perhaps using it incorrectly?
>
> best,
> -Brad
>


SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
Hi All,

I'm having some trouble setting the disk spill directory for spark.  The
following approaches set "spark.local.dir" (according to the "Environment"
tab of the web UI) but produce the indicated warnings:

*In spark-env.sh:*
export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill
*Associated warning:*
14/08/14 10:10:39 WARN SparkConf: In Spark 1.0 and later spark.local.dir
will be overridden by the value set by the cluster manager (via
SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
14/08/14 10:10:39 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Dspark.local.dir=/spark/spill').
This is deprecated in Spark 1.0+.
Please instead use...

*In spark-defaults.conf:*
spark.local.dir  /spark/spill
*Associated warning:*
14/08/14 10:09:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir
will be overridden by the value set by the cluster manager (via
SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).

The following does not produce any warnings, but also produces no sign of
actually setting "spark.local.dir":

*In spark-env.sh:*
export SPARK_LOCAL_DIRS=/spark/spill

Does anybody know whether SPARK_LOCAL_DIRS actually works as advertised, or
if I am perhaps using it incorrectly?

best,
-Brad


Mlib model: viewing and saving

2014-08-14 Thread Sameer Tilak
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)

I see model has following methods:algo   asInstanceOf   isInstanceOf   
predicttoString   topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = 
false, predict = 0.5, split = Some(Feature = 87, threshold = 
0.7931471805599453, featureType =  Continuous, categories = List()), stats = 
Some(gain = 0.89, impurity = 0.35, left impurity = 0.12, right 
impurity = 0.00, predict = 0.50)
I was wondering what is the best way to look at the model. We want to see what 
the decision tree looks like-- which features are selected, the details of 
splitting, what is the depth etc. Is there an easy way to see that? I can 
traverse it recursively using topNode.leftNode and topNode.rightNode. However, 
was wondering if there is any way to look at the model and also to save it on 
the hdfs for later use. 
  

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I tried with simple spark-hive select and insert, and it works. But to directly 
manipulate the ORCFile through RDD, spark has to be upgraded to support 
hive-0.13 first. Because some ORC API is not exposed until Hive-0.12.

Thanks.

Zhan Zhang


On Aug 11, 2014, at 10:23 PM, vinay.kash...@socialinfra.net wrote:

> Hi all,
> 
> Is it possible to use table with ORC format in Shark version 0.9.1 with Spark 
> 0.9.2 and Hive version 0.12.0..??
> 
> I have tried creating the ORC table in Shark using the below query
> 
> create table orc_table (x int, y string) stored as orc
> 
> create table works, but when I try to insert values from a text table 
> containing 2 rows
> 
> insert into table orc_table select * from text_table;
> 
> I get the below exception
> 
> org.apache.spark.SparkException: Job aborted: Task 3.0:1 failed 4 times (most 
> recent failure: Exception failure: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  Failed to create file 
> [/tmp/hive-windfarm/hive_2014-08-08_10-11-21_691_1945292644101251597/_task_tmp.-ext-1/_tmp.01_0]
>  for [DFSClient_attempt_201408081011__m_01_0_-341065575_80] on client 
> [], because this file is already being created by 
> [DFSClient_attempt_201408081011__m_01_0_82854889_71] on 
> [192.168.22.40]
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2306)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2235)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2188)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
> )
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> FAILED: Execution Error, return code -101 from shark.execution.SparkTask
>  
> Any idea how to overcome this..??
>  
>  
>  
> Thanks and regards
> Vinay Kashyap


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual o

RE: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Sameer Tilak

Hi Yanbo,I think it was happening because some of the rows did not have all the 
columns. We are cleaning up the data and will let you know once we confirm this.
Date: Thu, 14 Aug 2014 22:50:58 +0800
Subject: Re: java.lang.UnknownError: no bin was found for continuous variable.
From: yanboha...@gmail.com
To: ssti...@live.com

Can you supply the detail code and data you used.From the log, it looks like 
can not find the bin for specific feature.The bin for continuous feature is a 
unit that covers a specific range of the feature. 


2014-08-14 7:43 GMT+08:00 Sameer Tilak :




Hi All,
I am using the decision tree algorithm and I get the following error. Any help 
would be great!

java.lang.UnknownError: no bin was found for continuous variable.
at 
org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)  at 
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)  at 
scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)   
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)  
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) 
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)14/08/13 16:36:06 ERROR 
ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor 
task launch worker-0,5,main]
java.lang.UnknownError: no bin was found for continuous variable.   at 
org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)  at 
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)  at 
scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)   
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)  
at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) 
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
  

  

Re: Ways to partition the RDD

2014-08-14 Thread ssb61
You can try something like this,

val kvRdd = sc.textFile("rawdata/").map( m => { 
 val
pfUser = m.split("t",2)

(pfUser(0) -> pfUser(1))})
   .partitionBy(new
org.apache.spark.HashPartitioner(8))

You have a kvRdd with pageName as Key and UserID as Value.

Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
Hi All,

I have a Spark job for which I need to increase the amount of memory
allocated to the driver to collect a large-ish (>200M) data structure.
Formerly, I accomplished this by setting SPARK_MEM before invoking my
job (which effectively set memory on the driver) and then setting
spark.executor.memory before creating my spark context.  This was a
bit awkward since it wasn't clear exactly what SPARK_MEM was meant to
do (although in practice it affected only the driver).

Since the release of 1.0.0, I've started receiving messages saying to
set spark.executor.memory or SPARK_DRIVER_MEMORY.  This definitely
helps clear things up, but still feels a bit awkward since it seems
that most configuration can now be done from within the program
(indeed there are very few environment variables now listed on the
Spark configuration page).  Furthermore, SPARK_DRIVER_MEMORY doesn't
seem to appear anywhere in the web documentation.

Is there a better way to set SPARK_DRIVER_MEMORY, or some
documentation that I'm missing?

Is there a guiding principle that would help in figuring out which
configuration parameters are set through environment variables and
which are set programmatically, or somewhere to look in the source for
an exhaustive list of environment variable configuration options?

best,
-Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I
changed the default Class.forName call to Utils.classForName. See my patch
at https://github.com/apache/spark/pull/1916. After that change, my
bin/pyspark --jars worked.


On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein  wrote:

> Thanks. This was already helping a bit. But the examples don't use custom
> InputFormats. Rather, org.apache fully qualified InputFormat. If I want to
> use my own custom InputFormat in form of .class (or jar) how can I use it?
> I
> tried providing it to pyspark with --jars 
>
> and then using sc.newAPIHadoopFile(path,
> , .)
>
> However, that didn't work as it couldn't find the class.
>
> Any other idea?
>
> Thanks so far,
>  -Tassilo
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think in the following case

class Foo { def foo() = Array(1.0) }
val t = new Foo
val m = t.foo
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_ + m(0))
r2.toArray

Spark should not serialize "t". But looks it will.


Best Regards,
Shixiong Zhu


2014-08-14 23:22 GMT+08:00 lancezhange :

> Following codes  works, too
>
> class Foo1 extends Serializable { def foo() = Array(1.0) }
> val t1 = new Foo1
> val m1 = t1.foo
> val r11 = sc.parallelize(List(1, 2, 3))
> val r22 = r11.map(_ + m1(0))
> r22.toArray
>
>
>
>
>
> On Thu, Aug 14, 2014 at 10:55 PM, Shixiong Zhu [via Apache Spark User
> List] <[hidden email] 
> > wrote:
>
>> I think I can reproduce this error.
>>
>> The following code cannot work and report "Foo" cannot be
>> serialized. (log in gist
>> https://gist.github.com/zsxwing/4f9f17201d4378fe3e16):
>>
>> class Foo { def foo() = Array(1.0) }
>> val t = new Foo
>> val m = t.foo
>> val r1 = sc.parallelize(List(1, 2, 3))
>> val r2 = r1.map(_ + m(0))
>> r2.toArray
>>
>> But the following code can work (log in gist
>> https://gist.github.com/zsxwing/802cade0facb36a37656):
>>
>>  class Foo { def foo() = Array(1.0) }
>> var m: Array[Double] = null
>> {
>> val t = new Foo
>> m = t.foo
>> }
>> val r1 = sc.parallelize(List(1, 2, 3))
>> val r2 = r1.map(_ + m(0))
>> r2.toArray
>>
>>
>> Best Regards,
>> Shixiong Zhu
>>
>>
>> 2014-08-14 22:11 GMT+08:00 Christopher Nguyen <[hidden email]
>> >:
>>
>>> Hi Hoai-Thu, the issue of private default constructor is unlikely the
>>> cause here, since Lance was already able to load/deserialize the model
>>> object.
>>>
>>> And on that side topic, I wish all serdes libraries would just use
>>> constructor.setAccessible(true) by default :-) Most of the time that
>>> privacy is not about serdes reflection restrictions.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Aug 14, 2014 1:58 AM, "Hoai-Thu Vuong" <[hidden email]
>>> > wrote:
>>>
 A man in this community give me a video:
 https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question
 in this community and other guys helped me to solve this problem. I'm
 trying to load MatrixFactorizationModel from object file, but compiler said
 that, I can not create object because the constructor is private. To solve
 this, I put my new object to same package as MatrixFactorizationModel.
 Luckly it works.


 On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen <[hidden email]
 > wrote:

> Lance, some debugging ideas: you might try model.predict(RDD[Vector])
> to isolate the cause to serialization of the loaded model. And also try to
> serialize the deserialized (loaded) model "manually" to see if that throws
> any visible exceptions.
>
> Sent while mobile. Pls excuse typos etc.
> On Aug 13, 2014 7:03 AM, "lancezhange" <[hidden email]
> > wrote:
>
>> my prediction codes are simple enough as follows:
>>
>>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>>   val prediction = model.predict(point.features)
>>   (point.label, prediction)
>>   }*
>>
>> when model is the loaded one, above code just can't work. Can you
>> catch the
>> error?
>> Thanks.
>>
>> PS. i use spark-shell under standalone mode, version 1.0.0
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> 
>> For additional commands, e-mail: [hidden email]
>> 
>>
>>


 --
 Thu.

>>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12112.html
>>  To unsubscribe from How to save mllib model to hdfs and reload it, click
>> here.
>> NAML
>> 

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
Following codes  works, too

class Foo1 extends Serializable { def foo() = Array(1.0) }
val t1 = new Foo1
val m1 = t1.foo
val r11 = sc.parallelize(List(1, 2, 3))
val r22 = r11.map(_ + m1(0))
r22.toArray





On Thu, Aug 14, 2014 at 10:55 PM, Shixiong Zhu [via Apache Spark User List]
 wrote:

> I think I can reproduce this error.
>
> The following code cannot work and report "Foo" cannot be serialized. (log
> in gist https://gist.github.com/zsxwing/4f9f17201d4378fe3e16):
>
> class Foo { def foo() = Array(1.0) }
> val t = new Foo
> val m = t.foo
> val r1 = sc.parallelize(List(1, 2, 3))
> val r2 = r1.map(_ + m(0))
> r2.toArray
>
> But the following code can work (log in gist
> https://gist.github.com/zsxwing/802cade0facb36a37656):
>
>  class Foo { def foo() = Array(1.0) }
> var m: Array[Double] = null
> {
> val t = new Foo
> m = t.foo
> }
> val r1 = sc.parallelize(List(1, 2, 3))
> val r2 = r1.map(_ + m(0))
> r2.toArray
>
>
> Best Regards,
> Shixiong Zhu
>
>
> 2014-08-14 22:11 GMT+08:00 Christopher Nguyen <[hidden email]
> >:
>
>> Hi Hoai-Thu, the issue of private default constructor is unlikely the
>> cause here, since Lance was already able to load/deserialize the model
>> object.
>>
>> And on that side topic, I wish all serdes libraries would just use
>> constructor.setAccessible(true) by default :-) Most of the time that
>> privacy is not about serdes reflection restrictions.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Aug 14, 2014 1:58 AM, "Hoai-Thu Vuong" <[hidden email]
>> > wrote:
>>
>>> A man in this community give me a video:
>>> https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question
>>> in this community and other guys helped me to solve this problem. I'm
>>> trying to load MatrixFactorizationModel from object file, but compiler said
>>> that, I can not create object because the constructor is private. To solve
>>> this, I put my new object to same package as MatrixFactorizationModel.
>>> Luckly it works.
>>>
>>>
>>> On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen <[hidden email]
>>> > wrote:
>>>
 Lance, some debugging ideas: you might try model.predict(RDD[Vector])
 to isolate the cause to serialization of the loaded model. And also try to
 serialize the deserialized (loaded) model "manually" to see if that throws
 any visible exceptions.

 Sent while mobile. Pls excuse typos etc.
 On Aug 13, 2014 7:03 AM, "lancezhange" <[hidden email]
 > wrote:

> my prediction codes are simple enough as follows:
>
>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>   val prediction = model.predict(point.features)
>   (point.label, prediction)
>   }*
>
> when model is the loaded one, above code just can't work. Can you
> catch the
> error?
> Thanks.
>
> PS. i use spark-shell under standalone mode, version 1.0.0
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>>>
>>>
>>> --
>>> Thu.
>>>
>>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12112.html
>  To unsubscribe from How to save mllib model to hdfs and reload it, click
> here
> 
> .
> NAML
> 
>



-- 
 -- 张喜升




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12114.html
Sent from the Apache Spark User List mailing list archive at Nabble

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Silvio Fiorito
You also need to ensure you're using checkpointing and support recreating the 
context on driver failure as described in the docs here: 
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node

From: Matt Narrell mailto:matt.narr...@gmail.com>>
Date: Thursday, August 14, 2014 at 10:34 AM
To: Tobias Pfeiffer mailto:t...@preferred.jp>>
Cc: salemi mailto:alireza.sal...@udo.edu>>, 
"user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: Re: spark streaming : what is the best way to make a driver highly 
available

I'd suggest something like Apache YARN, or Apache Mesos with Marathon or 
something similar to allow for management, in particular restart on failure.

mn

On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer 
mailto:t...@preferred.jp>> wrote:

Hi,

On Thu, Aug 14, 2014 at 5:49 AM, salemi 
mailto:alireza.sal...@udo.edu>> wrote:
what is the best way to make a spark streaming driver highly available.

I would also be interested in that. In particular for Streaming applications 
where the Spark driver is running for a long time, this might be important, I 
think.

Thanks
Tobias




Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think I can reproduce this error.

The following code cannot work and report "Foo" cannot be serialized. (log
in gist https://gist.github.com/zsxwing/4f9f17201d4378fe3e16):

class Foo { def foo() = Array(1.0) }
val t = new Foo
val m = t.foo
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_ + m(0))
r2.toArray

But the following code can work (log in gist
https://gist.github.com/zsxwing/802cade0facb36a37656):

class Foo { def foo() = Array(1.0) }
var m: Array[Double] = null
{
val t = new Foo
m = t.foo
}
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_ + m(0))
r2.toArray


Best Regards,
Shixiong Zhu


2014-08-14 22:11 GMT+08:00 Christopher Nguyen :

> Hi Hoai-Thu, the issue of private default constructor is unlikely the
> cause here, since Lance was already able to load/deserialize the model
> object.
>
> And on that side topic, I wish all serdes libraries would just use
> constructor.setAccessible(true) by default :-) Most of the time that
> privacy is not about serdes reflection restrictions.
>
> Sent while mobile. Pls excuse typos etc.
> On Aug 14, 2014 1:58 AM, "Hoai-Thu Vuong"  wrote:
>
>> A man in this community give me a video:
>> https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
>> this community and other guys helped me to solve this problem. I'm trying
>> to load MatrixFactorizationModel from object file, but compiler said that,
>> I can not create object because the constructor is private. To solve this,
>> I put my new object to same package as MatrixFactorizationModel. Luckly it
>> works.
>>
>>
>> On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen 
>> wrote:
>>
>>> Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
>>> isolate the cause to serialization of the loaded model. And also try to
>>> serialize the deserialized (loaded) model "manually" to see if that throws
>>> any visible exceptions.
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Aug 13, 2014 7:03 AM, "lancezhange"  wrote:
>>>
 my prediction codes are simple enough as follows:

   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
   val prediction = model.predict(point.features)
   (point.label, prediction)
   }*

 when model is the loaded one, above code just can't work. Can you catch
 the
 error?
 Thanks.

 PS. i use spark-shell under standalone mode, version 1.0.0




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>
>>
>> --
>> Thu.
>>
>


Using Spark Streaming to listen to HDFS directory and handle different files by file name

2014-08-14 Thread ZhangYi
As we know, in Spark, SparkContext provide the wholeTextFile() method to read 
all files in the specific directory, then generate RDD(fileName, content):  
scala> val lines = sc.wholeTextFiles("/Users/workspace/scala101/data")
14/08/14 22:43:02 INFO MemoryStore: ensureFreeSpace(35896) called with 
curMem=0, maxMem=318111744
14/08/14 22:43:02 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 35.1 KB, free 303.3 MB)
lines: org.apache.spark.rdd.RDD[(String, String)] = 
/Users/workspace/scala101/data WholeTextFileRDD[0] at wholeTextFiles at 
:12


Does StreamContext provide the similar function to listen to the incoming files 
on HDFS? So that I can handle different files by file name on Spark Streaming.  
 

--  
ZhangYi (张逸)
Developer
tel: 15023157626
blog: agiledon.github.com
weibo: tw张逸
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



Re: Down-scaling Spark on EC2 cluster

2014-08-14 Thread Shubhabrata
What about down-scaling when I use Mesos, does that really deteriorate the
performance ? Otherwise we would probably go for spark on mesos on ec2 :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p12109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Matt Narrell
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or 
something similar to allow for management, in particular restart on failure.

mn

On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer  wrote:

> Hi,
> 
> On Thu, Aug 14, 2014 at 5:49 AM, salemi  wrote:
> what is the best way to make a spark streaming driver highly available.
> 
> I would also be interested in that. In particular for Streaming applications 
> where the Spark driver is running for a long time, this might be important, I 
> think.
> 
> Thanks
> Tobias
> 



Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause
here, since Lance was already able to load/deserialize the model object.

And on that side topic, I wish all serdes libraries would just use
constructor.setAccessible(true) by default :-) Most of the time that
privacy is not about serdes reflection restrictions.

Sent while mobile. Pls excuse typos etc.
On Aug 14, 2014 1:58 AM, "Hoai-Thu Vuong"  wrote:

> A man in this community give me a video:
> https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
> this community and other guys helped me to solve this problem. I'm trying
> to load MatrixFactorizationModel from object file, but compiler said that,
> I can not create object because the constructor is private. To solve this,
> I put my new object to same package as MatrixFactorizationModel. Luckly it
> works.
>
>
> On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen 
> wrote:
>
>> Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
>> isolate the cause to serialization of the loaded model. And also try to
>> serialize the deserialized (loaded) model "manually" to see if that throws
>> any visible exceptions.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Aug 13, 2014 7:03 AM, "lancezhange"  wrote:
>>
>>> my prediction codes are simple enough as follows:
>>>
>>>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>>>   val prediction = model.predict(point.features)
>>>   (point.label, prediction)
>>>   }*
>>>
>>> when model is the loaded one, above code just can't work. Can you catch
>>> the
>>> error?
>>> Thanks.
>>>
>>> PS. i use spark-shell under standalone mode, version 1.0.0
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>
> --
> Thu.
>


Re: Python + Spark unable to connect to S3 bucket .... "Invalid hostname in URI"

2014-08-14 Thread Miroslaw
I have tried that already but still get the same error.

To be honestly, I feel as though I am missing something obvious with my
configuration, I just can't find what it may be.

Miroslaw Horbal


On Wed, Aug 13, 2014 at 10:38 PM, jerryye [via Apache Spark User List] <
ml-node+s1001560n12082...@n3.nabble.com> wrote:

> Using s3n:// worked for me.
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12082.html
>  To unsubscribe from Python + Spark unable to connect to S3 bucket 
> "Invalid hostname in URI", click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-unable-to-connect-to-S3-bucket-Invalid-hostname-in-URI-tp12076p12107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Akhil Das
Hi Darin,

This is the piece of code
 doing the
actual work (Setting the memory). As you can see, it leaves 15Gb of ram for
OS on a > 100Gb machine... 2Gb RAM on a 10-20Gb machine etc.
You can always set SPARK_WORKER_MEMORY/SPARK_EXECUTOR_MEMORY to change
these values.

Thanks
Best Regards


On Thu, Aug 14, 2014 at 6:02 PM, Darin McBeath 
wrote:

> I started up a cluster on EC2 (using the provided scripts) and specified a
> different instance type for the master and the the worker nodes.  The
> cluster started fine, but when I looked at the cluster (via port 8080), it
> showed that the amount of memory available to the worker nodes did not
> match the instance type I had specified.  Instead, the amount of memory for
> the worker nodes matched the master node.  I did verify that the correct
> instance types had been started for the master and worker nodes.
>
> Curious as to whether this is expected behavior or if this might be a bug?
>
> Thanks.
>
> Darin.
>


Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Darin McBeath
I started up a cluster on EC2 (using the provided scripts) and specified a 
different instance type for the master and the the worker nodes.  The cluster 
started fine, but when I looked at the cluster (via port 8080), it showed that 
the amount of memory available to the worker nodes did not match the instance 
type I had specified.  Instead, the amount of memory for the worker nodes 
matched the master node.  I did verify that the correct instance types had been 
started for the master and worker nodes.

Curious as to whether this is expected behavior or if this might be a bug?

Thanks.

Darin.

Re: Script to deploy spark to Google compute engine

2014-08-14 Thread Michael Hausenblas

Did you check out http://www.spark-stack.org/spark-cluster-on-google-compute/ 
already?

Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 14 Aug 2014, at 05:17, Soumya Simanta  wrote:

> 
> Before I start doing something on my own I wanted to check if someone has 
> created a script to deploy the latest version of Spark to Google Compute 
> Engine. 
> 
> Thanks
> -Soumya
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Script to deploy spark to Google compute engine

2014-08-14 Thread Mayur Rustagi
We have a version that is submitted for PR
https://github.com/sigmoidanalytics/spark_gce/tree/for_spark
We are working on a more generic implementation based on lib_cloud... would
love collaborate if you are interested..

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Thu, Aug 14, 2014 at 8:47 AM, Soumya Simanta 
wrote:

>
> Before I start doing something on my own I wanted to check if someone has
> created a script to deploy the latest version of Spark to Google Compute
> Engine.
>
> Thanks
> -Soumya
>
>


Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
It is interesting to save a RDD on a disk or HDFS or somethings else as a
set of objects, but I think it's more useful to save it as a text file for
debugging or just as an output file. If we want to reuse a RDD, text file
also works, but perhaps a set of object files will bring a decrease on
executive time. I wonder how much time it can decrease compared with
transforming a text file to a RDD you want, need time to try :)


Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Hoai-Thu Vuong
A man in this community give me a video:
https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
this community and other guys helped me to solve this problem. I'm trying
to load MatrixFactorizationModel from object file, but compiler said that,
I can not create object because the constructor is private. To solve this,
I put my new object to same package as MatrixFactorizationModel. Luckly it
works.


On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen  wrote:

> Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
> isolate the cause to serialization of the loaded model. And also try to
> serialize the deserialized (loaded) model "manually" to see if that throws
> any visible exceptions.
>
> Sent while mobile. Pls excuse typos etc.
> On Aug 13, 2014 7:03 AM, "lancezhange"  wrote:
>
>> my prediction codes are simple enough as follows:
>>
>>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>>   val prediction = model.predict(point.features)
>>   (point.label, prediction)
>>   }*
>>
>> when model is the loaded one, above code just can't work. Can you catch
>> the
>> error?
>> Thanks.
>>
>> PS. i use spark-shell under standalone mode, version 1.0.0
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
Thu.


Re: How to direct insert vaules into SparkSQL tables?

2014-08-14 Thread chutium
oh, right, i meant within SqlContext alone, schemaRDD from text file with a
case class



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Viewing web UI after fact

2014-08-14 Thread Grzegorz Białek
Hi,

Thank you both for your answers. Browsing using Master UI works fine.
Unfortunately History Server shows "No Completed Applications Found" even
if logs exists under given directory, but using Master UI is enough for me.

Best regards,
Grzegorz


On Wed, Aug 13, 2014 at 8:09 PM, Andrew Or  wrote:

> The Spark UI isn't available through the same address; otherwise new
> applications won't be able to bind to it. Once the old application
> finishes, the standalone Master renders the after-the-fact application UI
> and exposes it under a different URL. To see this, go to the Master UI
> (:8080) and click on your application in the "Completed
> Applications" table.
>
>
> 2014-08-13 10:56 GMT-07:00 Matei Zaharia :
>
> Take a look at http://spark.apache.org/docs/latest/monitoring.html -- you
>> need to launch a history server to serve the logs.
>>
>> Matei
>>
>> On August 13, 2014 at 2:03:08 AM, grzegorz-bialek (
>> grzegorz.bia...@codilime.com) wrote:
>>
>> Hi,
>> I wanted to access Spark web UI after application stops. I set
>> spark.eventLog.enabled to true and logs are availaible
>> in JSON format in /tmp/spark-event but web UI isn't available under
>> address
>> http://:4040
>> I'm running Spark in standalone mode.
>>
>> What should I do to access web UI after application ends?
>>
>> Thanks,
>> Grzegorz
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-tp12023.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Hoai-Thu Vuong
I've found a method saveAsObjectFile in RDD (or JavaRDD). I think we can
save this array to file and load back to object when read these file.
However, I've known the way to load back and cast RDD to
specific object, need time to try.


On Thu, Aug 14, 2014 at 3:48 PM, Gefei Li  wrote:

> Thank you! It works so well for me!
>
> Regards,
> Gefei
>
>
> On Thu, Aug 14, 2014 at 4:25 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> FlatMap the JavaRDD to JavaRDD. Then it
>> should work.
>>
>> TD
>>
>>
>> On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li  wrote:
>>
>>> Hello,
>>> I wrote a class named BooleanPair:
>>>
>>> public static class BooleanPairet implements Serializable{
>>> public Boolean elementBool1;
>>> public Boolean elementBool2;
>>> BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1;
>>> elementBool2 = bool2;}
>>> public String toString(){return new String("(" + elementBool1 +
>>> "," + elementBool2 + ")");}
>>> }
>>>
>>> And I create a RDD like this: javaRDD, a RDD of a
>>> BooleanPair array, when I use the method saveAsTextFile, and it gives me
>>> something like this"[Lmy.package.name.MyClassName$BooleanPair;@5b777a88",
>>> while it goes well with a javaRDD, just a value instead of an
>>> array.
>>>
>>> What should I do to save it in a format like "(true, false), (false,
>>> false)..."?
>>>
>>> My Spark version is 1.0.0, and I use the CDH-5.1.0 distribution.
>>>
>>> Thanks & Best Regards,
>>> Gefei
>>>
>>
>>
>


-- 
Thu.


Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
Thank you! It works so well for me!

Regards,
Gefei


On Thu, Aug 14, 2014 at 4:25 PM, Tathagata Das 
wrote:

> FlatMap the JavaRDD to JavaRDD. Then it should
> work.
>
> TD
>
>
> On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li  wrote:
>
>> Hello,
>> I wrote a class named BooleanPair:
>>
>> public static class BooleanPairet implements Serializable{
>> public Boolean elementBool1;
>> public Boolean elementBool2;
>> BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1;
>> elementBool2 = bool2;}
>> public String toString(){return new String("(" + elementBool1 +
>> "," + elementBool2 + ")");}
>> }
>>
>> And I create a RDD like this: javaRDD, a RDD of a
>> BooleanPair array, when I use the method saveAsTextFile, and it gives me
>> something like this"[Lmy.package.name.MyClassName$BooleanPair;@5b777a88",
>> while it goes well with a javaRDD, just a value instead of an
>> array.
>>
>> What should I do to save it in a format like "(true, false), (false,
>> false)..."?
>>
>> My Spark version is 1.0.0, and I use the CDH-5.1.0 distribution.
>>
>> Thanks & Best Regards,
>> Gefei
>>
>
>


read performance issue

2014-08-14 Thread Gurvinder Singh
Hi,

I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost same once data is cached in memory, but read from HDFS is well
slow compare to Aug 1 version.

- Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Tathagata Das
FlatMap the JavaRDD to JavaRDD. Then it should
work.

TD


On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li  wrote:

> Hello,
> I wrote a class named BooleanPair:
>
> public static class BooleanPairet implements Serializable{
> public Boolean elementBool1;
> public Boolean elementBool2;
> BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1;
> elementBool2 = bool2;}
> public String toString(){return new String("(" + elementBool1 +
> "," + elementBool2 + ")");}
> }
>
> And I create a RDD like this: javaRDD, a RDD of a
> BooleanPair array, when I use the method saveAsTextFile, and it gives me
> something like this"[Lmy.package.name.MyClassName$BooleanPair;@5b777a88",
> while it goes well with a javaRDD, just a value instead of an
> array.
>
> What should I do to save it in a format like "(true, false), (false,
> false)..."?
>
> My Spark version is 1.0.0, and I use the CDH-5.1.0 distribution.
>
> Thanks & Best Regards,
> Gefei
>


how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
Hello,
I wrote a class named BooleanPair:

public static class BooleanPairet implements Serializable{
public Boolean elementBool1;
public Boolean elementBool2;
BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1;
elementBool2 = bool2;}
public String toString(){return new String("(" + elementBool1 + ","
+ elementBool2 + ")");}
}

And I create a RDD like this: javaRDD, a RDD of a
BooleanPair array, when I use the method saveAsTextFile, and it gives me
something like this"[Lmy.package.name.MyClassName$BooleanPair;@5b777a88",
while it goes well with a javaRDD, just a value instead of an
array.

What should I do to save it in a format like "(true, false), (false,
false)..."?

My Spark version is 1.0.0, and I use the CDH-5.1.0 distribution.

Thanks & Best Regards,
Gefei


Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye
bump. same problem here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:

~~~
val rows = rdd.map { case (j, values) =>
  values.view.zipWithIndex.map { case (v, i) =>
(i, (j, v))
  }
}.groupByKey().map { case (i, entries) =>
  Vectors.dense(entries.sortBy(_._1).map(_._2).toArray)
}

val mat = new RowMatrix(rows)
val cov = mat.computeCovariance()
~~~

On Wed, Aug 13, 2014 at 3:56 PM, ldmtwo  wrote:
> Need help getting around these errors.
>
> I have this program that runs fine on smaller input sizes. As it gets
> larger, Spark has increasing difficulty of being efficient and functioning
> without errors. We have about 46GB free on each node. The workers and
> executors are configured to use this up (the only way not to have Heap Space
> or GC overhead errors). On the driver, the data only uses 1.2GB RAM and is
> in the form of /matrix: RDD[(Integer, Array[Float])]/. It's a matrix that is
> column major with dimensions of 15k x 20k (columns). Each column takes about
> 4*15k = 60KB. 60KB*20k = 1.2GB. The data is not even that large. Eventually,
> I want to test 60k x 70k.
>
> The Covariance Matrix algorithm we are using is basicly. O(N^3) At minimum,
> the outer loop needs to be parallelized.
>   for each column i in matrix
>  for each column j in matrix
>   get the covariance between columns i and j
>
> Covariance is practically this. (no need to parallelize since we have enough
> work to do and this is small)
> for the two columns, get the sum of squares. O(N)
>
>
> Since I can't figure out a way to do permutation or nested for loop on RDD
> any other way, I had to call matrix.cartesian(matrix).map{ pair => ... }. I
> could do 5kx5k (1/4th of the work) using HashMap instead of RDD and finish
> in 10 sec. If I partition with 3k, it takes 18 hours. 300 takes 12 hours.
> 200 fails (error #1). 16 would be ideal (error #2). Note that I set the Akka
> frame size (spark-defaults.conf) to 15 to address some of the other errors
> with Akka.
>
>
>
>
>
> This is error #1
>
>
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
> |
>
> This is error 2
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Akka-actor-failures-tp12071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: training recsys model

2014-08-14 Thread Xiangrui Meng
Try many combinations of parameters on a small dataset, find the best,
and then try to map them to a big dataset. You can also reduce the
search region iteratively based on the best combination in the current
iteration. -Xiangrui

On Wed, Aug 13, 2014 at 1:13 AM, Hoai-Thu Vuong  wrote:
> Thank you very much. I've read this tutorial, and understand what they've
> done. However, the ranks set, or number of iterations set is human defined,
> we can not sure the optimal value is in these set. By the way, I may expect
> or do some wrong thing, should find the best model.
>
>
> On Wed, Aug 13, 2014 at 1:26 PM, Xiangrui Meng  wrote:
>>
>> You can define an evaluation metric first and then use a grid search
>> to find the best set of training parameters. Ampcamp has a tutorial
>> showing how to do this for ALS:
>>
>> http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
>> -Xiangrui
>>
>> On Tue, Aug 12, 2014 at 8:01 PM, Hoai-Thu Vuong  wrote:
>> > In MLLib, I found the method to train matrix factorization model to
>> > predict
>> > the taste of user. In this function, there are some parameters such as
>> > lambda, and rank, I can not find the best value to set these parameters
>> > and
>> > how to optimize this value. Could you please give me some recommends?
>> >
>> > --
>> > Thu.
>
>
>
>
> --
> Thu.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org