[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: Its not selecting entire records when I put Limit after doing caching. So you can close this issue.) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: Its a caching issue, while scanning the table need to cache the Data Frame, so from next onwards it wont take much time) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: I will look in to this issue) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: When I verified with 2 lakhs records, I am able to check the milliseconds difference through the program, which clearly says its not scanning entire records. But you improve the query performance by using dataframe object's cache method. I can see significant improvemnt in the performance of query. Please see the below Code snippet. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data ) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: Maciej Bryński Can you ellaborate it?) > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: public class JavaSparkSQL { public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } } public static void main(String[] args) throws Exception { long millis1 = System.currentTimeMillis() % 1000; SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL").setMaster("local[4]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new SQLContext(ctx); // Load a text file and convert each line to a Java Bean. JavaRDD people = ctx.textFile("/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people_1.txt").map( new Function() { @Override public Person call(String line) { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); // Apply a schema to an RDD of Java Beans and register it as a table. DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); schemaPeople.registerTempTable("people"); // SQL can be run over RDDs that have been registered as tables. //DataFrame teenagers = sqlContext.sql("SELECT age, name FROM people WHERE age >= 13 AND age <= 19"); //DataFrame teenagers = sqlContext.sql("SELECT * FROM people"); DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); // The results of SQL queries are DataFrames and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. List teenagerNames = teenagers.toJavaRDD().map(new Function() { @Override public People call(Row row) {long millis2 = System.currentTimeMillis() % 1000; People people = new People(); people.setAge(row.getInt(0)); people.setName(row.getString(1)); //System.out.println(people.toString()); return people; } }).collect(); long millis2 = System.currentTimeMillis() % 1000; long millis3 = millis2 - millis1; System.out.println("difference = "+String.valueOf(millis3)); /* for (String name: teenagerNames) { System.out.println("=>"+name); } */ /* System.out.println("=== Data source: Parquet File ==="); // DataFrames can be saved as parquet files, maintaining the schema information. schemaPeople.write().parquet("people.parquet"); // Read in the parquet file created above. // Parquet files are self-describing so the schema is preserved. // The result of loading a parquet file is also a DataFrame. DataFrame parquetFile = sqlContext.read().parquet("people.parquet"); //Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerTempTable("parquetFile"); DataFrame teenagers2 = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"); teenagerNames = teenagers2.toJavaRDD().map(new Function() { @Override public String call(Row row) { return "Name: " + row.getString(0); } }).collect(); for (String name: teenagerNames) { System.out.println(name); } System.out.println("=== Data source: JSON Dataset ==="); // A JSON dataset is pointed by path. // The path can be either a single text file or a directory storing text files. String path = "/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people.json"; // Create a DataFrame from the file(s) pointed by path DataFrame peopleFromJsonFile = sqlContext.read().json(path); // Because the schema of a JSON dataset is automatically inferred, to write queries, // it is better to take a look at what is the schema. peopleFromJsonFile.printSchema(); // The schema of people is ... // root // |-- age: IntegerType // |-- name: StringType // Register this DataFrame as a table. peopleFromJsonFile.registerTempTable("people"); // SQL statements can be run by using the sql methods provided by sqlContext. DataFrame teenagers3 = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19"); // The results of SQL queries are DataFrame and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagerNames = teenagers3.toJavaRDD().map(new Function() {