date:20151025

Re: get host from rdd map

2015-10-25 Thread Deenar Toraskar

   1. You can call any api that returns you the hostname in your map
   function. Here's a simplified example, You would generally use
   mapPartitions as it will save the overhead of retrieving hostname multiple
   times
   2.
   3. import scala.sys.process._
   4. val distinctHosts = sc.parallelize(0 to 100).map { _ =>
   5. val hostname = ("hostname".!!).trim
   6. // your code
   7. (hostname)
   8. }.collect.distinct
   9.

On 24 October 2015 at 01:41, weoccc  wrote:

> yea,
>
> my use cases is that i want to have some external communications where rdd
> is being run in map. The external communication might be handled separately
> transparent to spark.  What will be the hacky way and nonhacky way to do
> that ? :)
>
> Weide
>
>
>
> On Fri, Oct 23, 2015 at 5:32 PM, Ted Yu  wrote:
>
>> Can you outline your use case a bit more ?
>>
>> Do you want to know all the hosts which would run the map ?
>>
>> Cheers
>>
>> On Fri, Oct 23, 2015 at 5:16 PM, weoccc  wrote:
>>
>>> in rdd map function, is there a way i can know the list of host names
>>> where the map runs ? any code sample would be appreciated ?
>>>
>>> thx,
>>>
>>> Weide
>>>
>>>
>>>
>>
>

Re: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Deenar Toraskar

Embedded Derby, which Hive/Spark SQL uses as the default metastore only
supports a single user at a time. Till this issue is fixed, you could use
another metastore that supports multiple concurrent users (e.g. networked
derby or mysql) to get around it.

On 25 October 2015 at 16:15, Ge, Yao (Y.)  wrote:

> Thanks. I wonder why this is not widely reported in the user forum. The
> RELP shell is basically broken in 1.5 .0 and 1.5.1
>
> -Yao
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Sunday, October 25, 2015 12:01 PM
> *To:* Ge, Yao (Y.)
> *Cc:* user
> *Subject:* Re: Spark scala REPL - Unable to create sqlContext
>
>
>
> Have you taken a look at the fix for SPARK-11000 which is in the upcoming
> 1.6.0 release ?
>
>
>
> Cheers
>
>
>
> On Sun, Oct 25, 2015 at 8:42 AM, Yao  wrote:
>
> I have not been able to start Spark scala shell since 1.5 as it was not
> able
> to create the sqlContext during the startup. It complains the metastore_db
> is already locked: "Another instance of Derby may have already booted the
> database". The Derby log is attached.
>
> I only have this problem with starting the shell in yarn-client mode. I am
> working with HDP2.2.6 which runs Hadoop 2.6.
>
> -Yao derby.log
>  >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scala-REPL-Unable-to-create-sqlContext-tp25195.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

[Yarn-Client]Can not access SparkUI

2015-10-25 Thread Earthson

We are using Spark 1.5.1 with `--master yarn`, Yarn RM is running in HA mode.

direct visit




click ApplicationMaster link




YARN RM log






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Meihua Wu

please add "setFitIntercept(false)" to your LinearRegression.

LinearRegression by default includes an intercept in the model, e.g.
label = intercept + features dot weight

To get the result you want, you need to force the intercept to be zero.

Just curious, are you trying to solve systems of linear equations? If
so, you can probably try breeze.



On Sun, Oct 25, 2015 at 9:10 PM, Zhiliang Zhu
 wrote:
>
>
>
> On Monday, October 26, 2015 11:26 AM, Zhiliang Zhu
>  wrote:
>
>
> Hi DB Tsai,
>
> Thanks very much for your kind help. I  get it now.
>
> I am sorry that there is another issue, the weight/coefficient result is
> perfect while A is triangular matrix, however, while A is not triangular
> matrix (but
> transformed from triangular matrix, still is invertible), the result seems
> not perfect and difficult to make it better by resetting the parameter.
> Would you help comment some about that...
>
> List localTraining = Lists.newArrayList(
>   new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>   new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>   new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>   new LabeledPoint(-3.0, Vectors.dense(0.0, 0.0, -1.0, 0.0)));
> ...
> LinearRegression lr = new LinearRegression()
>   .setMaxIter(2)
>   .setRegParam(0)
>   .setElasticNetParam(0);
> 
>
> --
>
> It seems that no matter how to reset the parameters for lr , the output of
> x3 and x4 is always nearly the same .
> Whether there is some way to make the result a little better...
>
>
> --
>
> x3 and x4 could not become better, the output is:
> Final w:
> [0.999477672867,1.999748740578,3.500112393734,3.50011239377]
>
> Thank you,
> Zhiliang
>
>
>
> On Monday, October 26, 2015 10:25 AM, DB Tsai  wrote:
>
>
> Column 4 is always constant, so no predictive power resulting zero weight.
>
> On Sunday, October 25, 2015, Zhiliang Zhu  wrote:
>
> Hi DB Tsai,
>
> Thanks very much for your kind reply help.
>
> As for your comment, I just modified and tested the key part of the codes:
>
>  LinearRegression lr = new LinearRegression()
>.setMaxIter(1)
>.setRegParam(0)
>.setElasticNetParam(0);  //the number could be reset
>
>  final LinearRegressionModel model = lr.fit(training);
>
> Now the output is much reasonable, however, x4 is always 0 while repeatedly
> reset those parameters in lr , would you help some about it how to properly
> set the parameters ...
>
> Final w: [1.00127825909,1.99979185054,2.3307136,0.0]
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, October 26, 2015 5:14 AM, DB Tsai  wrote:
>
>
> LinearRegressionWithSGD is not stable. Please use linear regression in
> ML package instead.
> http://spark.apache.org/docs/latest/ml-linear-methods.html
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
>  wrote:
>> Dear All,
>>
>> I have some program as below which makes me very much confused and
>> inscrutable, it is about multiple dimension linear regression mode, the
>> weight / coefficient is always perfect while the dimension is smaller than
>> 4, otherwise it is wrong all the time.
>> Or, whether the LinearRegressionWithSGD would be selected for another one?
>>
>> public class JavaLinearRegression {
>>  public static void main(String[] args) {
>>SparkConf conf = new SparkConf().setAppName("Linear Regression
>> Example");
>>JavaSparkContext sc = new JavaSparkContext(conf);
>>SQLContext jsql = new SQLContext(sc);
>>
>>//Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
>>//x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
>>List localTraining = Lists.newArrayList(
>>new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>>new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>>new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>>new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>>
>>JavaRDD training = sc.parallelize(localTraining).cache();
>>
>>// Building the model
>>int numIterations = 1000; //the number could be reset large
>>final LinearRegressionModel model =
>> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>>
>>//the coefficient weights are perfect while dimension of LabeledPoint
>> is
>> SMALLER than 4.
>>//otherwise the output is always wrong and inscru

Re: Secondary Sorting in Spark

2015-10-25 Thread swetha kasireddy

Hi,

Does the use of custom partitioner in Streaming affect performance?

On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase  wrote:

> Great article, especially the use of a custom partitioner.
>
> Also, sorting by multiple fields by creating a tuple out of them is an
> awesome, easy to miss, Scala feature.
>
> Sent from my iPhone
>
> On 04 Oct 2015, at 21:41, Bill Bejeck  wrote:
>
> I've written blog post on secondary sorting in Spark and I'd thought I'd
> share it with the group
>
> http://codingjunkie.net/spark-secondary-sort/
>
> Thanks,
> Bill
>
>

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu

 On Monday, October 26, 2015 11:26 AM, Zhiliang Zhu 
 wrote:

 Hi DB Tsai,
Thanks very much for your kind help. I  get it now.
I am sorry that there is another issue, the weight/coefficient result is 
perfect while A is triangular matrix, however, while A is not triangular matrix 
(but 
transformed from triangular matrix, still is invertible), the result seems not 
perfect and difficult to make it better by resetting the parameter.Would you 
help comment some about that...
List localTraining = Lists.newArrayList(
  new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
  new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
  new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
  new LabeledPoint(-3.0, Vectors.dense(0.0, 0.0, -1.0, 
0.0)));...LinearRegression lr = new LinearRegression()
  .setMaxIter(2)
  .setRegParam(0)
  .setElasticNetParam(0);

--

It seems that no matter how to reset the parameters for lr , the output of x3 
and x4 is always nearly the same .Whether there is some way to make the result 
a little better...

--

x3 and x4 could not become better, the output is:
Final w: 
[0.999477672867,1.999748740578,3.500112393734,3.50011239377]   

Thank you,Zhiliang 

 On Monday, October 26, 2015 10:25 AM, DB Tsai  wrote:

 Column 4 is always constant, so no predictive power resulting zero weight.

On Sunday, October 25, 2015, Zhiliang Zhu  wrote:

Hi DB Tsai,
Thanks very much for your kind reply help.
As for your comment, I just modified and tested the key part of the codes:
 LinearRegression lr = new LinearRegression()
   .setMaxIter(1)
   .setRegParam(0)
   .setElasticNetParam(0);  //the number could be reset

 final LinearRegressionModel model = lr.fit(training);
Now the output is much reasonable, however, x4 is always 0 while repeatedly 
reset those parameters in lr , would you help some about it how to properly set 
the parameters ...
Final w: [1.00127825909,1.99979185054,2.3307136,0.0]

Thank you,Zhiliang

 On Monday, October 26, 2015 5:14 AM, DB Tsai  wrote:

 LinearRegressionWithSGD is not stable. Please use linear regression in
ML package instead.
http://spark.apache.org/docs/latest/ml-linear-methods.html

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
 wrote:
> Dear All,
>
> I have some program as below which makes me very much confused and
> inscrutable, it is about multiple dimension linear regression mode, the
> weight / coefficient is always perfect while the dimension is smaller than
> 4, otherwise it is wrong all the time.
> Or, whether the LinearRegressionWithSGD would be selected for another one?
>
> public class JavaLinearRegression {
>  public static void main(String[] args) {
>    SparkConf conf = new SparkConf().setAppName("Linear Regression
> Example");
>    JavaSparkContext sc = new JavaSparkContext(conf);
>    SQLContext jsql = new SQLContext(sc);
>
>    //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
>    //x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
>    List localTraining = Lists.newArrayList(
>        new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>        new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>
>    JavaRDD training = sc.parallelize(localTraining).cache();
>
>    // Building the model
>    int numIterations = 1000; //the number could be reset large
>    final LinearRegressionModel model =
> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>
>    //the coefficient weights are perfect while dimension of LabeledPoint is
> SMALLER than 4.
>    //otherwise the output is always wrong and inscrutable.
>    //for instance, one output is
>    //Final w:
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
>    System.out.print("Final w: " + model.weights() + "\n\n");
>  }
> }
>
>  I would appreciate your kind help or guidance very much~~
>
> Thank you!
> Zhiliang
>
>

-- 
- DBSent from my iPhone

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu

Hi DB Tsai,
Thanks very much for your kind help. I  get it now.
I am sorry that there is another issue, the weight/coefficient result is 
perfect while A is triangular matrix, however, while A is not triangular matrix 
(but 
transformed from triangular matrix, still is invertible), the result seems not 
perfect and difficult to make it better by resetting the parameter.Would you 
help comment some about that...
List localTraining = Lists.newArrayList(
  new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
  new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
  new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
  new LabeledPoint(-3.0, Vectors.dense(0.0, 0.0, -1.0, 
0.0)));...LinearRegression lr = new LinearRegression()
  .setMaxIter(2)
  .setRegParam(0)
  .setElasticNetParam(0);

x3 and x4 could not become better, the output is:
Final w: 
[0.999477672867,1.999748740578,3.500112393734,3.50011239377]   

Thank you,Zhiliang 

 On Monday, October 26, 2015 10:25 AM, DB Tsai  wrote:

 Column 4 is always constant, so no predictive power resulting zero weight.

On Sunday, October 25, 2015, Zhiliang Zhu  wrote:

Hi DB Tsai,
Thanks very much for your kind reply help.
As for your comment, I just modified and tested the key part of the codes:
 LinearRegression lr = new LinearRegression()
   .setMaxIter(1)
   .setRegParam(0)
   .setElasticNetParam(0);  //the number could be reset

 final LinearRegressionModel model = lr.fit(training);
Now the output is much reasonable, however, x4 is always 0 while repeatedly 
reset those parameters in lr , would you help some about it how to properly set 
the parameters ...
Final w: [1.00127825909,1.99979185054,2.3307136,0.0]

Thank you,Zhiliang

 On Monday, October 26, 2015 5:14 AM, DB Tsai  wrote:

 LinearRegressionWithSGD is not stable. Please use linear regression in
ML package instead.
http://spark.apache.org/docs/latest/ml-linear-methods.html

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
 wrote:
> Dear All,
>
> I have some program as below which makes me very much confused and
> inscrutable, it is about multiple dimension linear regression mode, the
> weight / coefficient is always perfect while the dimension is smaller than
> 4, otherwise it is wrong all the time.
> Or, whether the LinearRegressionWithSGD would be selected for another one?
>
> public class JavaLinearRegression {
>  public static void main(String[] args) {
>    SparkConf conf = new SparkConf().setAppName("Linear Regression
> Example");
>    JavaSparkContext sc = new JavaSparkContext(conf);
>    SQLContext jsql = new SQLContext(sc);
>
>    //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
>    //x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
>    List localTraining = Lists.newArrayList(
>        new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>        new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>
>    JavaRDD training = sc.parallelize(localTraining).cache();
>
>    // Building the model
>    int numIterations = 1000; //the number could be reset large
>    final LinearRegressionModel model =
> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>
>    //the coefficient weights are perfect while dimension of LabeledPoint is
> SMALLER than 4.
>    //otherwise the output is always wrong and inscrutable.
>    //for instance, one output is
>    //Final w:
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
>    System.out.print("Final w: " + model.weights() + "\n\n");
>  }
> }
>
>  I would appreciate your kind help or guidance very much~~
>
> Thank you!
> Zhiliang
>
>

-- 
- DBSent from my iPhone

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam

Hi guys,

I mentioned that the partitions are generated so I tried to read the
partition data from it. The driver is OOM after few minutes. The stack
trace is below. It looks very similar to the the jstack above (note on the
refresh method). Thanks!

Name: java.lang.OutOfMemoryError
Message: GC overhead limit exceeded
StackTrace: java.util.Arrays.copyOf(Arrays.java:2367)
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
java.lang.StringBuilder.append(StringBuilder.java:132)
org.apache.hadoop.fs.Path.toString(Path.java:384)
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447)
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(interfaces.scala:447)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(interfaces.scala:447)
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh(interfaces.scala:453)org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute(interfaces.scala:465)org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache(interfaces.scala:463)
org.apache.spark.sql.sources.HadoopFsRelation.cachedLeafStatuses(interfaces.scala:470)
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:381)org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:196)
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:196)
scala.Option.getOrElse(Option.scala:120)
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:196)
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31)
org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267)


On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam  wrote:

> Hi Josh,
>
> No I don't have speculation enabled. The driver took about few hours until
> it was OOM. Interestingly, all partitions are generated successfully
> (_SUCCESS file is written in the output directory). Is there a reason why
> the driver needs so much memory? The jstack revealed that it called refresh
> some file statuses. Is there a way to avoid OutputCommitCoordinator to
> use so much memory?
>
> Ultimately, I choose to use partitions because most of the queries I have
> will execute based the partition field. For example, "SELECT events from
> customer where customer_id = 1234". If the partition is based on
> customer_id, all events for a customer can be easily retrieved without
> filtering the entire dataset which is much more efficient (I hope).
> However, I notice that the implementation of the partition logic does not
> seem to allow this type of use cases without using a lot of memory which is
> a bit odd in my opinion. Any help will be greatly appreciated.
>
> Best Regards,
>
> Jerry
>
>
>
> On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen  wrote:
>
>> Hi Jerry,
>>
>> Do you have speculation enabled? A write which produces one million files
>> / output partitions might be using tons of driver memory via the
>> OutputCommitCoordinator's bookkeeping data structures.
>>
>> On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam  wrote:
>>
>>> Hi spark guys,
>>>
>>> I think I hit the same issue SPARK-8890
>>> https://issues.apache.org/jira/browse/SPARK-8890. It is marked as
>>> resolved. However it is not. I have over a million output directories for 1
>>> single column in partitionBy. Not sure if this is a regression issue? Do I
>>> need to set some parameters to make it more memory efficient?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam

Hi Josh,

No I don't have speculation enabled. The driver took about few hours until
it was OOM. Interestingly, all partitions are generated successfully
(_SUCCESS file is written in the output directory). Is there a reason why
the driver needs so much memory? The jstack revealed that it called refresh
some file statuses. Is there a way to avoid OutputCommitCoordinator to use
so much memory?

Ultimately, I choose to use partitions because most of the queries I have
will execute based the partition field. For example, "SELECT events from
customer where customer_id = 1234". If the partition is based on
customer_id, all events for a customer can be easily retrieved without
filtering the entire dataset which is much more efficient (I hope).
However, I notice that the implementation of the partition logic does not
seem to allow this type of use cases without using a lot of memory which is
a bit odd in my opinion. Any help will be greatly appreciated.

Best Regards,

Jerry



On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen  wrote:

> Hi Jerry,
>
> Do you have speculation enabled? A write which produces one million files
> / output partitions might be using tons of driver memory via the
> OutputCommitCoordinator's bookkeeping data structures.
>
> On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam  wrote:
>
>> Hi spark guys,
>>
>> I think I hit the same issue SPARK-8890
>> https://issues.apache.org/jira/browse/SPARK-8890. It is marked as
>> resolved. However it is not. I have over a million output directories for 1
>> single column in partitionBy. Not sure if this is a regression issue? Do I
>> need to set some parameters to make it more memory efficient?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>>
>>
>> On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam  wrote:
>>
>>> Hi guys,
>>>
>>> After waiting for a day, it actually causes OOM on the spark driver. I
>>> configure the driver to have 6GB. Note that I didn't call refresh myself.
>>> The method was called when saving the dataframe in parquet format. Also I'm
>>> using partitionBy() on the DataFrameWriter to generate over 1 million
>>> files. Not sure why it OOM the driver after the job is marked _SUCCESS in
>>> the output folder.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam  wrote:
>>>
 Hi Spark users and developers,

 Does anyone encounter any issue when a spark SQL job produces a lot of
 files (over 1 millions), the job hangs on the refresh method? I'm using
 spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
 but the driver is doing something very intensively (it uses all the cpus).
 Does it mean Spark SQL cannot be used to produce over 1 million files in a
 single job?

 Thread 528: (state = BLOCKED)
  - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled
 frame)
  - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43,
 line=130 (Compiled frame)
  - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
 line=114 (Compiled frame)
  - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
 line=415 (Compiled frame)
  - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
 (Compiled frame)
  - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled
 frame)
  -
 org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
 @bci=4, line=447 (Compiled frame)
  -
 org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
 @bci=5, line=447 (Compiled frame)
  -
 scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
 @bci=9, line=244 (Compiled frame)
  -
 scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
 @bci=2, line=244 (Compiled frame)
  -
 scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
 scala.Function1) @bci=22, line=33 (Compiled frame)
  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1)
 @bci=2, line=108 (Compiled frame)
  -
 scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
 scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
 (Compiled frame)
  - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
 scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
  -
 org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
 @bci=279, line=447 (Interpreted frame)
  -
 org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
 @bci=8, line=453 (Interpreted frame)
  - 
 org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
 @bci=26, line

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai

Column 4 is always constant, so no predictive power resulting zero weight.

On Sunday, October 25, 2015, Zhiliang Zhu  wrote:

> Hi DB Tsai,
>
> Thanks very much for your kind reply help.
>
> As for your comment, I just modified and tested the key part of the codes:
>
>  LinearRegression lr = new LinearRegression()
>.setMaxIter(1)
>.setRegParam(0)
>.setElasticNetParam(0);  //the number could be reset
>
>  final LinearRegressionModel model = lr.fit(training);
>
> Now the output is much reasonable, however, x4 is always 0 while
> repeatedly reset those parameters in lr , would you help some about it how
> to properly set the parameters ...
>
> Final w: [1.00127825909,1.99979185054,2.3307136,0.0]
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, October 26, 2015 5:14 AM, DB Tsai  > wrote:
>
>
> LinearRegressionWithSGD is not stable. Please use linear regression in
> ML package instead.
> http://spark.apache.org/docs/latest/ml-linear-methods.html
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
>  > wrote:
> > Dear All,
> >
> > I have some program as below which makes me very much confused and
> > inscrutable, it is about multiple dimension linear regression mode, the
> > weight / coefficient is always perfect while the dimension is smaller
> than
> > 4, otherwise it is wrong all the time.
> > Or, whether the LinearRegressionWithSGD would be selected for another
> one?
> >
> > public class JavaLinearRegression {
> >  public static void main(String[] args) {
> >SparkConf conf = new SparkConf().setAppName("Linear Regression
> > Example");
> >JavaSparkContext sc = new JavaSparkContext(conf);
> >SQLContext jsql = new SQLContext(sc);
> >
> >//Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
> >//x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
> >List localTraining = Lists.newArrayList(
> >new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
> >new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
> >new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
> >new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
> >
> >JavaRDD training =
> sc.parallelize(localTraining).cache();
> >
> >// Building the model
> >int numIterations = 1000; //the number could be reset large
> >final LinearRegressionModel model =
> > LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
> >
> >//the coefficient weights are perfect while dimension of LabeledPoint
> is
> > SMALLER than 4.
> >//otherwise the output is always wrong and inscrutable.
> >//for instance, one output is
> >//Final w:
> >
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
> >System.out.print("Final w: " + model.weights() + "\n\n");
> >  }
> > }
> >
> >  I would appreciate your kind help or guidance very much~~
> >
> > Thank you!
> > Zhiliang
> >
> >
>
>
>

-- 
- DB

Sent from my iPhone

RE: How to set memory for SparkR with master="local[*]"

2015-10-25 Thread Sun, Rui

As documented in 
http://spark.apache.org/docs/latest/configuration.html#available-properties,
Note for “spark.driver.memory”:
Note: In client mode, this config must not be set through the SparkConf 
directly in your application, because the driver JVM has already started at 
that point. Instead, please set this through the --driver-memory command line 
option or in your default properties file.

If you are to start a SparkR shell using bin/sparkR, then you can use 
bin/sparkR –driver-memory. You have no chance to set the driver memory size 
after the R shell has been launched via bin/sparkR.

Buf if you are to start a SparkR shell manually without using bin/sparkR (for 
example, in Rstudio), you can:
library(SparkR)
Sys.setenv("SPARKR_SUBMIT_ARGS" = "--conf spark.driver.memory=2g sparkr-shell")
sc <- sparkR.init()

From: Dirceu Semighini Filho [mailto:dirceu.semigh...@gmail.com]
Sent: Friday, October 23, 2015 7:53 PM
Cc: user
Subject: Re: How to set memory for SparkR with master="local[*]"

Hi Matej,
I'm also using this and I'm having the same behavior here, my driver has only 
530mb which is the default value.

Maybe this is a bug.

2015-10-23 9:43 GMT-02:00 Matej Holec 
mailto:hol...@gmail.com>>:
Hello!

How to adjust the memory settings properly for SparkR with master="local[*]"
in R?


*When running from  R -- SparkR doesn't accept memory settings :(*

I use the following commands:

R>  library(SparkR)
R>  sc <- sparkR.init(master = "local[*]", sparkEnvir =
list(spark.driver.memory = "5g"))

Despite the variable spark.driver.memory is correctly set (checked in
http://node:4040/environment/), the driver has only the default amount of
memory allocated (Storage Memory 530.3 MB).

*But when running from  spark-1.5.1-bin-hadoop2.6/bin/sparkR -- OK*

The following command:

]$ spark-1.5.1-bin-hadoop2.6/bin/sparkR --driver-memory 5g

creates SparkR session with properly adjustest driver memory (Storage Memory
2.6 GB).


Any suggestion?

Thanks
Matej



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-memory-for-SparkR-with-master-local-tp25178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu

Hi DB Tsai,
Thanks very much for your kind reply help.
As for your comment, I just modified and tested the key part of the codes:
 LinearRegression lr = new LinearRegression()
   .setMaxIter(1)
   .setRegParam(0)
   .setElasticNetParam(0);  //the number could be reset

 final LinearRegressionModel model = lr.fit(training);
Now the output is much reasonable, however, x4 is always 0 while repeatedly 
reset those parameters in lr , would you help some about it how to properly set 
the parameters ...
Final w: [1.00127825909,1.99979185054,2.3307136,0.0]

Thank you,Zhiliang

 


 On Monday, October 26, 2015 5:14 AM, DB Tsai  wrote:
   

 LinearRegressionWithSGD is not stable. Please use linear regression in
ML package instead.
http://spark.apache.org/docs/latest/ml-linear-methods.html

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
 wrote:
> Dear All,
>
> I have some program as below which makes me very much confused and
> inscrutable, it is about multiple dimension linear regression mode, the
> weight / coefficient is always perfect while the dimension is smaller than
> 4, otherwise it is wrong all the time.
> Or, whether the LinearRegressionWithSGD would be selected for another one?
>
> public class JavaLinearRegression {
>  public static void main(String[] args) {
>    SparkConf conf = new SparkConf().setAppName("Linear Regression
> Example");
>    JavaSparkContext sc = new JavaSparkContext(conf);
>    SQLContext jsql = new SQLContext(sc);
>
>    //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
>    //x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
>    List localTraining = Lists.newArrayList(
>        new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>        new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>        new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>
>    JavaRDD training = sc.parallelize(localTraining).cache();
>
>    // Building the model
>    int numIterations = 1000; //the number could be reset large
>    final LinearRegressionModel model =
> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>
>    //the coefficient weights are perfect while dimension of LabeledPoint is
> SMALLER than 4.
>    //otherwise the output is always wrong and inscrutable.
>    //for instance, one output is
>    //Final w:
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
>    System.out.print("Final w: " + model.weights() + "\n\n");
>  }
> }
>
>  I would appreciate your kind help or guidance very much~~
>
> Thank you!
> Zhiliang
>
>

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen

Hi Jerry,

Do you have speculation enabled? A write which produces one million files /
output partitions might be using tons of driver memory via the
OutputCommitCoordinator's bookkeeping data structures.

On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam  wrote:

> Hi spark guys,
>
> I think I hit the same issue SPARK-8890
> https://issues.apache.org/jira/browse/SPARK-8890. It is marked as
> resolved. However it is not. I have over a million output directories for 1
> single column in partitionBy. Not sure if this is a regression issue? Do I
> need to set some parameters to make it more memory efficient?
>
> Best Regards,
>
> Jerry
>
>
>
>
> On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam  wrote:
>
>> Hi guys,
>>
>> After waiting for a day, it actually causes OOM on the spark driver. I
>> configure the driver to have 6GB. Note that I didn't call refresh myself.
>> The method was called when saving the dataframe in parquet format. Also I'm
>> using partitionBy() on the DataFrameWriter to generate over 1 million
>> files. Not sure why it OOM the driver after the job is marked _SUCCESS in
>> the output folder.
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam  wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> Does anyone encounter any issue when a spark SQL job produces a lot of
>>> files (over 1 millions), the job hangs on the refresh method? I'm using
>>> spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
>>> but the driver is doing something very intensively (it uses all the cpus).
>>> Does it mean Spark SQL cannot be used to produce over 1 million files in a
>>> single job?
>>>
>>> Thread 528: (state = BLOCKED)
>>>  - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled
>>> frame)
>>>  - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130
>>> (Compiled frame)
>>>  - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
>>> line=114 (Compiled frame)
>>>  - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
>>> line=415 (Compiled frame)
>>>  - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
>>> (Compiled frame)
>>>  - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled
>>> frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
>>> @bci=4, line=447 (Compiled frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
>>> @bci=5, line=447 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>>> @bci=9, line=244 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>>> @bci=2, line=244 (Compiled frame)
>>>  -
>>> scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
>>> scala.Function1) @bci=22, line=33 (Compiled frame)
>>>  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1)
>>> @bci=2, line=108 (Compiled frame)
>>>  -
>>> scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
>>> scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
>>> (Compiled frame)
>>>  - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
>>> scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
>>> @bci=279, line=447 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
>>> @bci=8, line=453 (Interpreted frame)
>>>  - 
>>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
>>> @bci=26, line=465 (Interpreted frame)
>>>  - 
>>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
>>> @bci=12, line=463 (Interpreted frame)
>>>  - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1,
>>> line=540 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh()
>>> @bci=1, line=204 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
>>> @bci=392, line=152 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>>> @bci=1, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>>> @bci=1, line=108 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
>>> org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96,
>>> line=56 (Interpreted frame)
>>>  -
>>> org.apache.spark.sql.execution.datasources

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam

Hi spark guys,

I think I hit the same issue SPARK-8890
https://issues.apache.org/jira/browse/SPARK-8890. It is marked as resolved.
However it is not. I have over a million output directories for 1 single
column in partitionBy. Not sure if this is a regression issue? Do I need to
set some parameters to make it more memory efficient?

Best Regards,

Jerry




On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam  wrote:

> Hi guys,
>
> After waiting for a day, it actually causes OOM on the spark driver. I
> configure the driver to have 6GB. Note that I didn't call refresh myself.
> The method was called when saving the dataframe in parquet format. Also I'm
> using partitionBy() on the DataFrameWriter to generate over 1 million
> files. Not sure why it OOM the driver after the job is marked _SUCCESS in
> the output folder.
>
> Best Regards,
>
> Jerry
>
>
> On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam  wrote:
>
>> Hi Spark users and developers,
>>
>> Does anyone encounter any issue when a spark SQL job produces a lot of
>> files (over 1 millions), the job hangs on the refresh method? I'm using
>> spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
>> but the driver is doing something very intensively (it uses all the cpus).
>> Does it mean Spark SQL cannot be used to produce over 1 million files in a
>> single job?
>>
>> Thread 528: (state = BLOCKED)
>>  - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
>>  - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130
>> (Compiled frame)
>>  - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
>> line=114 (Compiled frame)
>>  - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
>> line=415 (Compiled frame)
>>  - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
>> (Compiled frame)
>>  - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled
>> frame)
>>  -
>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
>> @bci=4, line=447 (Compiled frame)
>>  -
>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
>> @bci=5, line=447 (Compiled frame)
>>  -
>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>> @bci=9, line=244 (Compiled frame)
>>  -
>> scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
>> @bci=2, line=244 (Compiled frame)
>>  -
>> scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
>> scala.Function1) @bci=22, line=33 (Compiled frame)
>>  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1)
>> @bci=2, line=108 (Compiled frame)
>>  -
>> scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
>> scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
>> (Compiled frame)
>>  - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
>> scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
>>  -
>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
>> @bci=279, line=447 (Interpreted frame)
>>  -
>> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
>> @bci=8, line=453 (Interpreted frame)
>>  - 
>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
>> @bci=26, line=465 (Interpreted frame)
>>  - 
>> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
>> @bci=12, line=463 (Interpreted frame)
>>  - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1,
>> line=540 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh()
>> @bci=1, line=204 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
>> @bci=392, line=152 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>> @bci=1, line=108 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
>> @bci=1, line=108 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
>> org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96,
>> line=56 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
>> @bci=718, line=108 (Interpreted frame)
>>  -
>> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute()
>> @bci=20, line=57 (Interpreted frame)
>>  - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult()
>> @bci=15, line=57 (Interpreted frame)
>>  - org.apache.spark.sql.executio

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam

Hi guys,

After waiting for a day, it actually causes OOM on the spark driver. I
configure the driver to have 6GB. Note that I didn't call refresh myself.
The method was called when saving the dataframe in parquet format. Also I'm
using partitionBy() on the DataFrameWriter to generate over 1 million
files. Not sure why it OOM the driver after the job is marked _SUCCESS in
the output folder.

Best Regards,

Jerry


On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam  wrote:

> Hi Spark users and developers,
>
> Does anyone encounter any issue when a spark SQL job produces a lot of
> files (over 1 millions), the job hangs on the refresh method? I'm using
> spark 1.5.1. Below is the stack trace. I saw the parquet files are produced
> but the driver is doing something very intensively (it uses all the cpus).
> Does it mean Spark SQL cannot be used to produce over 1 million files in a
> single job?
>
> Thread 528: (state = BLOCKED)
>  - java.util.Arrays.copyOf(char[], int) @bci=1, line=2367 (Compiled frame)
>  - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=43, line=130
> (Compiled frame)
>  - java.lang.AbstractStringBuilder.ensureCapacityInternal(int) @bci=12,
> line=114 (Compiled frame)
>  - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=19,
> line=415 (Compiled frame)
>  - java.lang.StringBuilder.append(java.lang.String) @bci=2, line=132
> (Compiled frame)
>  - org.apache.hadoop.fs.Path.toString() @bci=128, line=384 (Compiled frame)
>  -
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(org.apache.hadoop.fs.FileStatus)
> @bci=4, line=447 (Compiled frame)
>  -
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache$$anonfun$listLeafFiles$1.apply(java.lang.Object)
> @bci=5, line=447 (Compiled frame)
>  - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
> @bci=9, line=244 (Compiled frame)
>  - scala.collection.TraversableLike$$anonfun$map$1.apply(java.lang.Object)
> @bci=2, line=244 (Compiled frame)
>  -
> scala.collection.IndexedSeqOptimized$class.foreach(scala.collection.IndexedSeqOptimized,
> scala.Function1) @bci=22, line=33 (Compiled frame)
>  - scala.collection.mutable.ArrayOps$ofRef.foreach(scala.Function1)
> @bci=2, line=108 (Compiled frame)
>  -
> scala.collection.TraversableLike$class.map(scala.collection.TraversableLike,
> scala.Function1, scala.collection.generic.CanBuildFrom) @bci=17, line=244
> (Compiled frame)
>  - scala.collection.mutable.ArrayOps$ofRef.map(scala.Function1,
> scala.collection.generic.CanBuildFrom) @bci=3, line=108 (Interpreted frame)
>  -
> org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(java.lang.String[])
> @bci=279, line=447 (Interpreted frame)
>  - org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh()
> @bci=8, line=453 (Interpreted frame)
>  - 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute()
> @bci=26, line=465 (Interpreted frame)
>  - 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache()
> @bci=12, line=463 (Interpreted frame)
>  - org.apache.spark.sql.sources.HadoopFsRelation.refresh() @bci=1,
> line=540 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh()
> @bci=1, line=204 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp()
> @bci=392, line=152 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
> @bci=1, line=108 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply()
> @bci=1, line=108 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(org.apache.spark.sql.SQLContext,
> org.apache.spark.sql.SQLContext$QueryExecution, scala.Function0) @bci=96,
> line=56 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(org.apache.spark.sql.SQLContext)
> @bci=718, line=108 (Interpreted frame)
>  -
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute()
> @bci=20, line=57 (Interpreted frame)
>  - org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult()
> @bci=15, line=57 (Interpreted frame)
>  - org.apache.spark.sql.execution.ExecutedCommand.doExecute() @bci=12,
> line=69 (Interpreted frame)
>  - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply()
> @bci=11, line=140 (Interpreted frame)
>  - org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply()
> @bci=1, line=138 (Interpreted frame)
>  -
> org.apache.spark.rdd.RDDOperationScope$.withScope(org.apache.spark.SparkContext,
> java.lang.String, boolean, boolean, scala.Function0) @bci=131, line=147
> (Interpreted frame)
>  - org.apache.spark.sql.execution.SparkPlan.execut

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

So yes the individual artifacts are released however, there is no
deployable bundle prebuilt for Spark 1.5.1 and Scala 2.11.7, something
like:  spark-1.5.1-bin-hadoop-2.6_scala-2.11.tgz.  The spark site even
states this:

*Note: Scala 2.11 users should download the Spark source package and
build with Scala 2.11 support
.*
So if you want one simple deployable, for a standalone environment I
thought you had to perform the make-distribution like I described.

Clearly the individual artifacts are there as you state, is there a
provided 2.11 tgz available as well?  I did not think there was, if there
is then should the documentation on the download site be changed to reflect
this?

Sorry for the confusion.

-Todd

On Sun, Oct 25, 2015 at 4:07 PM, Sean Owen  wrote:

> No, 2.11 artifacts are in fact published:
> http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22
>
> On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist  wrote:
> > Sorry Sean you are absolutely right it supports 2.11 all o meant is
> there is
> > no release available as a standard download and that one has to build it.
> > Thanks for the clairification.
> > -Todd
> >
>

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh

Felix,

Missed your reply - agree looks like the same issue, resolved mine as
Duplicate.

Thanks!
Ram

On Sun, Oct 25, 2015 at 2:47 PM, Felix Cheung 
wrote:

>
>
> This might be related to https://issues.apache.org/jira/browse/SPARK-10500
>
>
>
> On Sun, Oct 25, 2015 at 9:57 AM -0700, "Ted Yu" 
> wrote:
>
> In zipRLibraries():
>
> // create a zip file from scratch, do not append to existing file.
> val zipFile = new File(dir, name)
>
> I guess instead of creating sparkr.zip in the same directory as R lib,
> the zip file can be created under some directory writable by the user
> launching the app and accessible by user 'yarn'.
>
> Cheers
>
> On Sun, Oct 25, 2015 at 8:29 AM, Ram Venkatesh 
> wrote:
>
> 
>
> If you run sparkR in yarn-client mode, it fails with
>
> Exception in thread "main" java.io.FileNotFoundException:
> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at
>
> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
> at
>
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> The behavior is the same when I use the pre-built spark-1.5.1-bin-hadoop2.6
> version also.
>
> Interestingly if I run as a user with write permissions to the R/lib
> directory, it succeeds. However, the sparkr.zip file is recreated each time
> sparkR is launched, so even if the file is present it has to be writable by
> the submitting user.
>
> Couple questions:
> 1. Can spark.zip be packaged once and placed in that location for multiple
> users
> 2. If not, is this location configurable, so that each user can specify a
> directory that they can write?
>
> Thanks!
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh

Ted Yu,

Agree that either picking up sparkr.zip if it already exists, or creating a
zip in a local scratch directory will work. This code is called by the
client side job submission logic and the resulting zip is already added to
the local resources for the YARN job, so I don't think the directory needs
be accessible by the user yarn or from the cluster. Filed
https://issues.apache.org/jira/browse/SPARK-11304 for this issue.

As a temporary hack workaround, I created a writable file called sparkr.zip
in R/lib and made it world writable.

Thanks
Ram

On Sun, Oct 25, 2015 at 9:56 AM, Ted Yu  wrote:

> In zipRLibraries():
>
> // create a zip file from scratch, do not append to existing file.
> val zipFile = new File(dir, name)
>
> I guess instead of creating sparkr.zip in the same directory as R lib,
> the zip file can be created under some directory writable by the user
> launching the app and accessible by user 'yarn'.
>
> Cheers
>
> On Sun, Oct 25, 2015 at 8:29 AM, Ram Venkatesh 
> wrote:
>
>> 
>>
>> If you run sparkR in yarn-client mode, it fails with
>>
>> Exception in thread "main" java.io.FileNotFoundException:
>> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
>> at java.io.FileOutputStream.open0(Native Method)
>> at java.io.FileOutputStream.open(FileOutputStream.java:270)
>> at java.io.FileOutputStream.(FileOutputStream.java:213)
>> at
>>
>> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
>> at
>>
>> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
>> at
>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>> at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> The behavior is the same when I use the pre-built
>> spark-1.5.1-bin-hadoop2.6
>> version also.
>>
>> Interestingly if I run as a user with write permissions to the R/lib
>> directory, it succeeds. However, the sparkr.zip file is recreated each
>> time
>> sparkR is launched, so even if the file is present it has to be writable
>> by
>> the submitting user.
>>
>> Couple questions:
>> 1. Can spark.zip be packaged once and placed in that location for multiple
>> users
>> 2. If not, is this location configurable, so that each user can specify a
>> directory that they can write?
>>
>> Thanks!
>> Ram
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Felix Cheung

This might be related to  https://issues.apache.org/jira/browse/SPARK-10500



On Sun, Oct 25, 2015 at 9:57 AM -0700, "Ted Yu"  wrote:
In zipRLibraries():

// create a zip file from scratch, do not append to existing file.
val zipFile = new File(dir, name)

I guess instead of creating sparkr.zip in the same directory as R lib, the
zip file can be created under some directory writable by the user launching
the app and accessible by user 'yarn'.

Cheers

On Sun, Oct 25, 2015 at 8:29 AM, Ram Venkatesh 
wrote:

> 
>
> If you run sparkR in yarn-client mode, it fails with
>
> Exception in thread "main" java.io.FileNotFoundException:
> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at
>
> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
> at
>
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> The behavior is the same when I use the pre-built spark-1.5.1-bin-hadoop2.6
> version also.
>
> Interestingly if I run as a user with write permissions to the R/lib
> directory, it succeeds. However, the sparkr.zip file is recreated each time
> sparkR is launched, so even if the file is present it has to be writable by
> the submitting user.
>
> Couple questions:
> 1. Can spark.zip be packaged once and placed in that location for multiple
> users
> 2. If not, is this location configurable, so that each user can specify a
> directory that they can write?
>
> Thanks!
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-25 Thread Lin Zhao

Have the issue resolved. In this case the hostname of my machine is configured 
to a public domain resolved to the EC2 machine's public IP. It's not allowed to 
bind to an elastic IP. I changed the hostnames to Amazon's private hostname 
(ip-72-xxx-xxx) then it works.

From: Steve Loughran 
Sent: Saturday, October 24, 2015 5:15 AM
To: Lin Zhao
Cc: user@spark.apache.org
Subject: Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

On 24 Oct 2015, at 00:46, Lin Zhao mailto:l...@exabeam.com>> 
wrote:

I have a spark on YARN deployed using Cloudera Manager 5.4. The installation 
went smoothly. But when I try to run spark-shell I get a long list of 
exceptions saying "failed to bind to: /public_ip_of_host:0" and "Service 
'sparkDriver' could not bind on port 0. Attempting port 1.".

Any idea why spark-shell tries to bind to port 0 and 1? And it doesn't seem 
right to trying to bind to the public ip of the host. I tried "nc -l" on the 
host and it won't work.

"0" means any free port. If a service cannot bind to port 0, you are in 
trouble, Usually with the IP address being wrong

https://wiki.apache.org/hadoop/UnsetHostnameOrPort

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai

LinearRegressionWithSGD is not stable. Please use linear regression in
ML package instead.
http://spark.apache.org/docs/latest/ml-linear-methods.html

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
 wrote:
> Dear All,
>
> I have some program as below which makes me very much confused and
> inscrutable, it is about multiple dimension linear regression mode, the
> weight / coefficient is always perfect while the dimension is smaller than
> 4, otherwise it is wrong all the time.
> Or, whether the LinearRegressionWithSGD would be selected for another one?
>
> public class JavaLinearRegression {
>   public static void main(String[] args) {
> SparkConf conf = new SparkConf().setAppName("Linear Regression
> Example");
> JavaSparkContext sc = new JavaSparkContext(conf);
> SQLContext jsql = new SQLContext(sc);
>
> //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
> //x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
> List localTraining = Lists.newArrayList(
> new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
> new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
> new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
> new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>
> JavaRDD training = sc.parallelize(localTraining).cache();
>
> // Building the model
> int numIterations = 1000; //the number could be reset large
> final LinearRegressionModel model =
> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>
> //the coefficient weights are perfect while dimension of LabeledPoint is
> SMALLER than 4.
> //otherwise the output is always wrong and inscrutable.
> //for instance, one output is
> //Final w:
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
> System.out.print("Final w: " + model.weights() + "\n\n");
>   }
> }
>
>   I would appreciate your kind help or guidance very much~~
>
> Thank you!
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen

No, 2.11 artifacts are in fact published:
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22

On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist  wrote:
> Sorry Sean you are absolutely right it supports 2.11 all o meant is there is
> no release available as a standard download and that one has to build it.
> Thanks for the clairification.
> -Todd
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert

Yes, I know, but it would be nice to be able to test things myself before I
push commits.

On Sun, Oct 25, 2015 at 3:50 PM, Ted Yu  wrote:

> If you have a pull request, Jenkins can test your change for you.
>
> FYI
>
> On Oct 25, 2015, at 12:43 PM, Richard Eggert 
> wrote:
>
> Also, if I run the Maven build on Windows or Linux without setting
> -DskipTests=true, it hangs indefinitely when it gets to
> org.apache.spark.JavaAPISuite.
>
> It's hard to test patches when the build doesn't work. :-/
>
> On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert 
> wrote:
>
>> By "it works", I mean, "It gets past that particular error". It still
>> fails several minutes later with a different error:
>>
>> java.lang.IllegalStateException: impossible to get artifacts when data
>> has not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3
>>
>>
>> On Sun, Oct 25, 2015 at 3:38 PM, Richard Eggert > > wrote:
>>
>>> When I try to start up sbt for the Spark build,  or if I try to import
>>> it in IntelliJ IDEA as an sbt project, it fails with a "No such file or
>>> directory" error when it attempts to "git clone" sbt-pom-reader into
>>> .sbt/0.13/staging/some-sha1-hash.
>>>
>>> If I manually create the expected directory before running sbt or
>>> importing into IntelliJ, then it works. Why is it necessary to do this,
>>> and what can be done to make it not necessary?
>>>
>>> Rich
>>>
>>
>>
>>
>> --
>> Rich
>>
>
>
>
> --
> Rich
>
>


-- 
Rich

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Ted Yu

If you have a pull request, Jenkins can test your change for you. 

FYI 

> On Oct 25, 2015, at 12:43 PM, Richard Eggert  wrote:
> 
> Also, if I run the Maven build on Windows or Linux without setting 
> -DskipTests=true, it hangs indefinitely when it gets to 
> org.apache.spark.JavaAPISuite.
> 
> It's hard to test patches when the build doesn't work. :-/
> 
>> On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert  
>> wrote:
>> By "it works", I mean, "It gets past that particular error". It still fails 
>> several minutes later with a different error: 
>> 
>> java.lang.IllegalStateException: impossible to get artifacts when data has 
>> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3
>> 
>> 
>>> On Sun, Oct 25, 2015 at 3:38 PM, Richard Eggert  
>>> wrote:
>>> When I try to start up sbt for the Spark build,  or if I try to import it 
>>> in IntelliJ IDEA as an sbt project, it fails with a "No such file or 
>>> directory" error when it attempts to "git clone" sbt-pom-reader into 
>>> .sbt/0.13/staging/some-sha1-hash.
>>> 
>>> If I manually create the expected directory before running sbt or importing 
>>> into IntelliJ, then it works. Why is it necessary to do this,  and what can 
>>> be done to make it not necessary?
>>> 
>>> Rich
>>> 
>> 
>> 
>> 
>> -- 
>> Rich
> 
> 
> 
> -- 
> Rich

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert

Also, if I run the Maven build on Windows or Linux without setting
-DskipTests=true, it hangs indefinitely when it gets to
org.apache.spark.JavaAPISuite.

It's hard to test patches when the build doesn't work. :-/

On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert 
wrote:

> By "it works", I mean, "It gets past that particular error". It still
> fails several minutes later with a different error:
>
> java.lang.IllegalStateException: impossible to get artifacts when data has
> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3
>
>
> On Sun, Oct 25, 2015 at 3:38 PM, Richard Eggert 
> wrote:
>
>> When I try to start up sbt for the Spark build,  or if I try to import it
>> in IntelliJ IDEA as an sbt project, it fails with a "No such file or
>> directory" error when it attempts to "git clone" sbt-pom-reader into
>> .sbt/0.13/staging/some-sha1-hash.
>>
>> If I manually create the expected directory before running sbt or
>> importing into IntelliJ, then it works. Why is it necessary to do this,
>> and what can be done to make it not necessary?
>>
>> Rich
>>
>
>
>
> --
> Rich
>



-- 
Rich

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert

By "it works", I mean, "It gets past that particular error". It still fails
several minutes later with a different error:

java.lang.IllegalStateException: impossible to get artifacts when data has
not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3

On Sun, Oct 25, 2015 at 3:38 PM, Richard Eggert 
wrote:

> When I try to start up sbt for the Spark build,  or if I try to import it
> in IntelliJ IDEA as an sbt project, it fails with a "No such file or
> directory" error when it attempts to "git clone" sbt-pom-reader into
> .sbt/0.13/staging/some-sha1-hash.
>
> If I manually create the expected directory before running sbt or
> importing into IntelliJ, then it works. Why is it necessary to do this,
> and what can be done to make it not necessary?
>
> Rich
>

-- 
Rich

Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert

When I try to start up sbt for the Spark build,  or if I try to import it
in IntelliJ IDEA as an sbt project, it fails with a "No such file or
directory" error when it attempts to "git clone" sbt-pom-reader into
.sbt/0.13/staging/some-sha1-hash.

If I manually create the expected directory before running sbt or importing
into IntelliJ, then it works. Why is it necessary to do this,  and what can
be done to make it not necessary?

Rich

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

Sorry Sean you are absolutely right it supports 2.11 all o meant is there
is no release available as a standard download and that one has to build
it.  Thanks for the clairification.
-Todd

On Sunday, October 25, 2015, Sean Owen  wrote:

> Hm, why do you say it doesn't support 2.11? It does.
>
> It is not even this difficult; you just need a source distribution,
> and then run "./dev/change-scala-version.sh 2.11" as you say. Then
> build as normal
>
> On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist  > wrote:
> > Hi Bilnmek,
> >
> > Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
> > build it like your trying.  Here are the steps I followed to build it on
> a
> > Max OS X 10.10.5 environment, should be very similar on ubuntu.
> >
> > 1.  set theJAVA_HOME environment variable in my bash session via export
> > JAVA_HOME=$(/usr/libexec/java_home).
> > 2. Spark is easiest to build with Maven so insure maven is installed, I
> > installed 3.3.x.
> > 3.  Download the source form Spark's site and extract.
> > 4.  Change into the spark-1.5.1 folder and run:
> >./dev/change-scala-version.sh 2.11
> > 5.  Issue the following command to build and create a distribution;
> >
> > ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> > -Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests
> >
> > This will provide you with a a fully self-contained installation of Spark
> > for Scala 2.11 including scripts and the like.  There are some
> limitations
> > see this,
> >
> http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
> ,
> > for what is not supported.
> >
> > HTH,
> >
> > -Todd
> >
> >
> > On Sun, Oct 25, 2015 at 10:56 AM, Bilinmek Istemiyor <
> benibi...@gmail.com >
> > wrote:
> >>
> >>
> >> I am just starting out apache spark. I hava zero knowledge about the
> spark
> >> environment, scala and sbt. I have a built problems which I could not
> solve.
> >> Any help much appreciated.
> >>
> >> I am using kubuntu 14.04, java "1.7.0_80, scala 2.11.7 and spark 1.5.1
> >>
> >> I tried to compile spark from source an and receive following errors
> >>
> >> [0m[[31merror[0m] [0mimpossible to get artifacts when data has not been
> >> loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> >> [0m[[31merror[0m] [0m(hive/*:[31mupdate[0m)
> >> java.lang.IllegalStateException: impossible to get artifacts when data
> has
> >> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> >> [0m[[31merror[0m] [0m(streaming-flume-sink/avro:[31mgenerate[0m)
> >> org.apache.avro.SchemaParseException: Undefined name: "strıng"[0m
> >> [0m[[31merror[0m] [0m(streaming-kafka-assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(streaming-mqtt/test:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0m(streaming-mqtt-assembly/*:[31massembly[0m)
> >> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> >> [0m[[31merror[0m] [0mTotal time: 1128 s, completed 25.Eki.2015
> 11:00:52[0m
> >>
> >> Sorry about some strange characters. I tried to capture the output with
> >>
> >> sbt clean assembly 2>&1 | tee compile.txt
> >>
> >> compile.txt was full of these characters.  I have attached the output of
> >> full compile process "compile.txt".
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> >> For additional commands, e-mail: user-h...@spark.apache.org
> 
> >
> >
>

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Ted Yu

A dependency couldn't be downloaded:

[INFO] +- com.h2database:h2:jar:1.4.183:test

Have you checked your network settings ?

Cheers

On Sun, Oct 25, 2015 at 10:22 AM, Bilinmek Istemiyor 
wrote:

> Thank you for the quick reply. You are God Send. I have long not been
> programming in java, nothing know about maven, scala, sbt ant spark stuff.
> I used java 7 since build failed with java 8. Which java version do you
> advise in general to use spark. I can downgrade scala version as well. Can
> you advise me a version number. I did not see any information at spark
> build page.
>
> I  followed your directions. However build finished with following output.
> I do not know if it is normal SQL Project build failure.
>
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [04:49
> min]
> [INFO] Spark Project Launcher . SUCCESS [04:14
> min]
> [INFO] Spark Project Networking ... SUCCESS [
> 20.354 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 5.287 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 18.095 s]
> [INFO] Spark Project Core . SUCCESS [12:59
> min]
> [INFO] Spark Project Bagel  SUCCESS [02:54
> min]
> [INFO] Spark Project GraphX ... SUCCESS [
> 29.764 s]
> [INFO] Spark Project Streaming  SUCCESS [01:12
> min]
> [INFO] Spark Project Catalyst . SUCCESS [02:58
> min]
> [INFO] Spark Project SQL .. FAILURE [02:50
> min]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Twitter . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Project External MQTT  SKIPPED
> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> [INFO] Spark Project External ZeroMQ .. SKIPPED
> [INFO] Spark Project External Kafka ... SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 33:15 min
> [INFO] Finished at: 2015-10-25T19:12:15+02:00
> [INFO] Final Memory: 63M/771M
> [INFO]
> 
> [ERROR] Failed to execute goal on project spark-sql_2.11: Could not
> resolve dependencies for project org.apache.spark:spark-sql_2.11:jar:1.5.1:
> Could not transfer artifact com.h2database:h2:jar:1.4.183 from/to central (
> https://repo1.maven.org/maven2): GET request of:
> com/h2database/h2/1.4.183/h2-1.4.183.jar from central failed: SSL peer shut
> down incorrectly -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-sql_2.11
>
>
> On Sun, Oct 25, 2015 at 6:00 PM, Todd Nist  wrote:
>
>> Hi Bilnmek,
>>
>> Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
>> build it like your trying.  Here are the steps I followed to build it on a
>> Max OS X 10.10.5 environment, should be very similar on ubuntu.
>>
>> 1.  set theJAVA_HOME environment variable in my bash session via export
>> JAVA_HOME=$(/usr/libexec/java_home).
>> 2. Spark is easiest to build with Maven so insure maven is installed, I
>> installed 3.3.x.
>> 3.  Download the source form Spark's site and extract.
>> 4.  Change into the spark-1.5.1 folder and run:
>>./dev/change-scala-version.sh 2.11
>> 5.  Issue the following command to build and create a distribution;
>>
>> ./make-d

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Bilinmek Istemiyor

Thank you for the quick reply. You are God Send. I have long not been
programming in java, nothing know about maven, scala, sbt ant spark stuff.
I used java 7 since build failed with java 8. Which java version do you
advise in general to use spark. I can downgrade scala version as well. Can
you advise me a version number. I did not see any information at spark
build page.

I  followed your directions. However build finished with following output.
I do not know if it is normal SQL Project build failure.

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [04:49
min]
[INFO] Spark Project Launcher . SUCCESS [04:14
min]
[INFO] Spark Project Networking ... SUCCESS [
20.354 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
5.287 s]
[INFO] Spark Project Unsafe ... SUCCESS [
18.095 s]
[INFO] Spark Project Core . SUCCESS [12:59
min]
[INFO] Spark Project Bagel  SUCCESS [02:54
min]
[INFO] Spark Project GraphX ... SUCCESS [
29.764 s]
[INFO] Spark Project Streaming  SUCCESS [01:12
min]
[INFO] Spark Project Catalyst . SUCCESS [02:58
min]
[INFO] Spark Project SQL .. FAILURE [02:50
min]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Twitter . SKIPPED
[INFO] Spark Project External Flume Sink .. SKIPPED
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Project External MQTT  SKIPPED
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 33:15 min
[INFO] Finished at: 2015-10-25T19:12:15+02:00
[INFO] Final Memory: 63M/771M
[INFO]

[ERROR] Failed to execute goal on project spark-sql_2.11: Could not resolve
dependencies for project org.apache.spark:spark-sql_2.11:jar:1.5.1: Could
not transfer artifact com.h2database:h2:jar:1.4.183 from/to central (
https://repo1.maven.org/maven2): GET request of:
com/h2database/h2/1.4.183/h2-1.4.183.jar from central failed: SSL peer shut
down incorrectly -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-sql_2.11

On Sun, Oct 25, 2015 at 6:00 PM, Todd Nist  wrote:

> Hi Bilnmek,
>
> Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
> build it like your trying.  Here are the steps I followed to build it on a
> Max OS X 10.10.5 environment, should be very similar on ubuntu.
>
> 1.  set theJAVA_HOME environment variable in my bash session via export
> JAVA_HOME=$(/usr/libexec/java_home).
> 2. Spark is easiest to build with Maven so insure maven is installed, I
> installed 3.3.x.
> 3.  Download the source form Spark's site and extract.
> 4.  Change into the spark-1.5.1 folder and run:
>./dev/change-scala-version.sh 2.11
> 5.  Issue the following command to build and create a distribution;
>
> ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> -Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests
>
> This will provide you with a a fully self-contained installation of Spark
> for Scala 2.11 including scripts and the like.  There are some limitations
> see this,
> http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211,
> for what

[SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu

Dear All,
I have some program as below which makes me very much confused and inscrutable, 
it is about multiple dimension linear regression mode, the weight / coefficient 
is always perfect while the dimension is smaller than 4, otherwise it is wrong 
all the time.Or, whether the LinearRegressionWithSGD would be selected for 
another one?
public class JavaLinearRegression {  public static void main(String[] args) {   
 SparkConf conf = new SparkConf().setAppName("Linear Regression Example");    
JavaSparkContext sc = new JavaSparkContext(conf);    SQLContext jsql = new 
SQLContext(sc);
    //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight     
//x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode    
List localTraining = Lists.newArrayList(        new 
LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),        new 
LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),        new 
LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),        new 
LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
    JavaRDD training = sc.parallelize(localTraining).cache();
    // Building the model    int numIterations = 1000; //the number could be 
reset large    final LinearRegressionModel model = 
LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
    //the coefficient weights are perfect while dimension of LabeledPoint is 
SMALLER than 4.    //otherwise the output is always wrong and inscrutable.    
//for instance, one output is    //Final w: 
[2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
    System.out.print("Final w: " + model.weights() + "\n\n");  }}    I would 
appreciate your kind help or guidance very much~~
Thank you!Zhiliang

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ted Yu

In zipRLibraries():

// create a zip file from scratch, do not append to existing file.
val zipFile = new File(dir, name)

I guess instead of creating sparkr.zip in the same directory as R lib, the
zip file can be created under some directory writable by the user launching
the app and accessible by user 'yarn'.

Cheers

On Sun, Oct 25, 2015 at 8:29 AM, Ram Venkatesh 
wrote:

> 
>
> If you run sparkR in yarn-client mode, it fails with
>
> Exception in thread "main" java.io.FileNotFoundException:
> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at
>
> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
> at
>
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> The behavior is the same when I use the pre-built spark-1.5.1-bin-hadoop2.6
> version also.
>
> Interestingly if I run as a user with write permissions to the R/lib
> directory, it succeeds. However, the sparkr.zip file is recreated each time
> sparkR is launched, so even if the file is present it has to be writable by
> the submitting user.
>
> Couple questions:
> 1. Can spark.zip be packaged once and placed in that location for multiple
> users
> 2. If not, is this location configurable, so that each user can specify a
> directory that they can write?
>
> Thanks!
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen

Hm, why do you say it doesn't support 2.11? It does.

It is not even this difficult; you just need a source distribution,
and then run "./dev/change-scala-version.sh 2.11" as you say. Then
build as normal

On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist  wrote:
> Hi Bilnmek,
>
> Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
> build it like your trying.  Here are the steps I followed to build it on a
> Max OS X 10.10.5 environment, should be very similar on ubuntu.
>
> 1.  set theJAVA_HOME environment variable in my bash session via export
> JAVA_HOME=$(/usr/libexec/java_home).
> 2. Spark is easiest to build with Maven so insure maven is installed, I
> installed 3.3.x.
> 3.  Download the source form Spark's site and extract.
> 4.  Change into the spark-1.5.1 folder and run:
>./dev/change-scala-version.sh 2.11
> 5.  Issue the following command to build and create a distribution;
>
> ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> -Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests
>
> This will provide you with a a fully self-contained installation of Spark
> for Scala 2.11 including scripts and the like.  There are some limitations
> see this,
> http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211,
> for what is not supported.
>
> HTH,
>
> -Todd
>
>
> On Sun, Oct 25, 2015 at 10:56 AM, Bilinmek Istemiyor 
> wrote:
>>
>>
>> I am just starting out apache spark. I hava zero knowledge about the spark
>> environment, scala and sbt. I have a built problems which I could not solve.
>> Any help much appreciated.
>>
>> I am using kubuntu 14.04, java "1.7.0_80, scala 2.11.7 and spark 1.5.1
>>
>> I tried to compile spark from source an and receive following errors
>>
>> [0m[[31merror[0m] [0mimpossible to get artifacts when data has not been
>> loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
>> [0m[[31merror[0m] [0m(hive/*:[31mupdate[0m)
>> java.lang.IllegalStateException: impossible to get artifacts when data has
>> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
>> [0m[[31merror[0m] [0m(streaming-flume-sink/avro:[31mgenerate[0m)
>> org.apache.avro.SchemaParseException: Undefined name: "strıng"[0m
>> [0m[[31merror[0m] [0m(streaming-kafka-assembly/*:[31massembly[0m)
>> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
>> [0m[[31merror[0m] [0m(streaming-mqtt/test:[31massembly[0m)
>> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
>> [0m[[31merror[0m] [0m(assembly/*:[31massembly[0m)
>> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
>> [0m[[31merror[0m] [0m(streaming-mqtt-assembly/*:[31massembly[0m)
>> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
>> [0m[[31merror[0m] [0mTotal time: 1128 s, completed 25.Eki.2015 11:00:52[0m
>>
>> Sorry about some strange characters. I tried to capture the output with
>>
>> sbt clean assembly 2>&1 | tee compile.txt
>>
>> compile.txt was full of these characters.  I have attached the output of
>> full compile process "compile.txt".
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ge, Yao (Y.)

Thanks. I wonder why this is not widely reported in the user forum. The RELP 
shell is basically broken in 1.5 .0 and 1.5.1
-Yao

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Sunday, October 25, 2015 12:01 PM
To: Ge, Yao (Y.)
Cc: user
Subject: Re: Spark scala REPL - Unable to create sqlContext

Have you taken a look at the fix for SPARK-11000 which is in the upcoming 1.6.0 
release ?

Cheers

On Sun, Oct 25, 2015 at 8:42 AM, Yao mailto:y...@ford.com>> 
wrote:
I have not been able to start Spark scala shell since 1.5 as it was not able
to create the sqlContext during the startup. It complains the metastore_db
is already locked: "Another instance of Derby may have already booted the
database". The Derby log is attached.

I only have this problem with starting the shell in yarn-client mode. I am
working with HDP2.2.6 which runs Hadoop 2.6.

-Yao derby.log

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scala-REPL-Unable-to-create-sqlContext-tp25195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

Re: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ted Yu

Have you taken a look at the fix for SPARK-11000 which is in the upcoming
1.6.0 release ?

Cheers

On Sun, Oct 25, 2015 at 8:42 AM, Yao  wrote:

> I have not been able to start Spark scala shell since 1.5 as it was not
> able
> to create the sqlContext during the startup. It complains the metastore_db
> is already locked: "Another instance of Derby may have already booted the
> database". The Derby log is attached.
>
> I only have this problem with starting the shell in yarn-client mode. I am
> working with HDP2.2.6 which runs Hadoop 2.6.
>
> -Yao derby.log
>  >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scala-REPL-Unable-to-create-sqlContext-tp25195.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist

Hi Bilnmek,

Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it
build it like your trying.  Here are the steps I followed to build it on a
Max OS X 10.10.5 environment, should be very similar on ubuntu.

1.  set theJAVA_HOME environment variable in my bash session via export
JAVA_HOME=$(/usr/libexec/java_home).
2. Spark is easiest to build with Maven so insure maven is installed, I
installed 3.3.x.
3.  Download the source form Spark's site and extract.
4.  Change into the spark-1.5.1 folder and run:
   ./dev/change-scala-version.sh 2.11
5.  Issue the following command to build and create a distribution;

./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Dscala-2.11 -DskipTests

This will provide you with a a fully self-contained installation of Spark
for Scala 2.11 including scripts and the like.  There are some limitations
see this,
http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211,
for what is not supported.

HTH,

-Todd

On Sun, Oct 25, 2015 at 10:56 AM, Bilinmek Istemiyor 
wrote:

>
> I am just starting out apache spark. I hava zero knowledge about the spark
> environment, scala and sbt. I have a built problems which I could not
> solve. Any help much appreciated.
>
> I am using kubuntu 14.04, java "1.7.0_80, scala 2.11.7 and spark 1.5.1
>
> I tried to compile spark from source an and receive following errors
>
> [0m[[31merror[0m] [0mimpossible to get artifacts when data has not been
> loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> [0m[[31merror[0m] [0m(hive/*:[31mupdate[0m)
> java.lang.IllegalStateException: impossible to get artifacts when data has
> not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3[0m
> [0m[[31merror[0m] [0m(streaming-flume-sink/avro:[31mgenerate[0m)
> org.apache.avro.SchemaParseException: Undefined name: "strıng"[0m
> [0m[[31merror[0m] [0m(streaming-kafka-assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(streaming-mqtt/test:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0m(streaming-mqtt-assembly/*:[31massembly[0m)
> java.util.zip.ZipException: duplicate entry: META-INF/MANIFEST.MF[0m
> [0m[[31merror[0m] [0mTotal time: 1128 s, completed 25.Eki.2015 11:00:52[0m
>
> Sorry about some strange characters. I tried to capture the output with
>
> sbt clean assembly 2>&1 | tee compile.txt
>
> compile.txt was full of these characters.  I have attached the output of
> full compile process "compile.txt".
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Yao

I have not been able to start Spark scala shell since 1.5 as it was not able
to create the sqlContext during the startup. It complains the metastore_db
is already locked: "Another instance of Derby may have already booted the
database". The Derby log is attached.

I only have this problem with starting the shell in yarn-client mode. I am
working with HDP2.2.6 which runs Hadoop 2.6.

-Yao derby.log
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-scala-REPL-Unable-to-create-sqlContext-tp25195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh



If you run sparkR in yarn-client mode, it fails with

Exception in thread "main" java.io.FileNotFoundException:
/usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at
org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The behavior is the same when I use the pre-built spark-1.5.1-bin-hadoop2.6
version also.

Interestingly if I run as a user with write permissions to the R/lib
directory, it succeeds. However, the sparkr.zip file is recreated each time
sparkR is launched, so even if the file is present it has to be writable by
the submitting user.

Couple questions:
1. Can spark.zip be packaged once and placed in that location for multiple
users 
2. If not, is this location configurable, so that each user can specify a
directory that they can write?

Thanks!
Ram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: question about HadoopFsRelation

2015-10-25 Thread Koert Kuipers

thanks i will read up on that

On Sat, Oct 24, 2015 at 12:53 PM, Ted Yu  wrote:

> The code below was introduced by SPARK-7673 / PR #6225
>
> See item #1 in the description of the PR.
>
> Cheers
>
> On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers  wrote:
>
>> the code that seems to flatMap directories to all the files inside is in
>> the private HadoopFsRelation.buildScan:
>>
>> // First assumes `input` is a directory path, and tries to get all
>> files contained in it.
>>   fileStatusCache.leafDirToChildrenFiles.getOrElse(
>> path,
>> // Otherwise, `input` might be a file path
>> fileStatusCache.leafFiles.get(path).toArray
>>
>> does anyone know why we want to get all the files when all hadoop
>> inputformats can handle directories (and automatically get the files
>> inside), and the recommended way of doing this in map-red is to pass in
>> directories (to avoid the overhead and very large serialized jobconfs)?
>>
>>
>> On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers 
>> wrote:
>>
>>> i noticed in the comments for HadoopFsRelation.buildScan it says:
>>>   * @param inputFiles For a non-partitioned relation, it contains paths
>>> of all data files in the
>>>*relation. For a partitioned relation, it contains paths of
>>> all data files in a single
>>>*selected partition.
>>>
>>> do i understand it correctly that it actually lists all the data files
>>> (part files), not just data directories that contain the files?
>>> if so,that sounds like trouble to me, because most implementations will
>>> use this info to set the input paths for FileInputFormat. for example in
>>> ParquetRelation:
>>> if (inputFiles.nonEmpty) {
>>>   FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
>>> }
>>>
>>> if all the part files are listed this will make the jobConf huge and
>>> could cause issues upon serialization and/or broadcasting.
>>>
>>> it can also lead to other inefficiencies, for example spark-avro creates
>>> a RDD for every input (part) file, which quickly leads to thousands of RDDs.
>>>
>>> i think instead of files only the directories should be listed in the
>>> input path?
>>>
>>
>>
>

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-25 Thread Nipun Arora

So essentially the driver/client program needs to explicitly have two
threads to ensure concurrency?

What happens when the program is sequential... I.e. I execute function A
and then function B. Does this mean that each RDD first goes through
function A, and them stream X is persisted, but processed in function B
only after the RDD has been processed by A?

Thanks
Nipun
On Sat, Oct 24, 2015 at 5:32 PM Andy Dang  wrote:

> If you execute the collect step (foreach in 1, possibly reduce in 2) in
> two threads in the driver then both of them will be executed in parallel.
> Whichever gets submitted to Spark first gets executed first - you can use a
> semaphore if you need to ensure the ordering of execution, though I would
> assume that the ordering wouldn't matter.
>
> ---
> Regards,
> Andy
>
> On Sat, Oct 24, 2015 at 10:08 PM, Nipun Arora 
> wrote:
>
>> I wanted to understand something about the internals of spark streaming
>> executions.
>>
>> If I have a stream X, and in my program I send stream X to function A and
>> function B:
>>
>> 1. In function A, I do a few transform/filter operations etc. on X->Y->Z
>> to create stream Z. Now I do a forEach Operation on Z and print the output
>> to a file.
>>
>> 2. Then in function B, I reduce stream X -> X2 (say min value of each
>> RDD), and print the output to file
>>
>> Are both functions being executed for each RDD in parallel? How does it
>> work?
>>
>> Thanks
>> Nipun
>>
>>
>

40 matches

Mail list logo